## use this code to access the data that contains missing values
bikedat.miss <- read.csv("https://raw.githubusercontent.com/anlane611/datasets/main/bike_miss.csv", row.names = 1)
## compare results to the model that uses the full dataset (use the same variables)
bikedat <- read.csv("https://raw.githubusercontent.com/anlane611/datasets/main/bike_full.csv")
summary(glm(bikers ~ windspeed + hum + temp, family="poisson", data=bikedat))11.7: Missing Data
Learning objectives
Identify 3 mechanisms of missingness
Describe 5 imputation methods
The Problem of Missingness
Missing data is a common problem
Always start by exploring the missingness!
Number/proportion of observations with any missing data?
Missing in one or more variables?
Missing in outcome, predictors, or both?
Pattern of missingness? (e.g., \(X_1\) always missing when \(X_2=1\) ?)
Even the best analysis cannot make up for poor study design/data collection
Missing Data Mechanisms
- Missing completely at random (MCAR)
Reason for missingness does not depend on the values of the observed data
Each observation has an equal probability of missingness
Example: A survey has a front and back, and some participants have missing data for the questions on the back page
- Missing at random (MAR)
Reason for missingness depends on the values of the observed data
Probability of missingness is the same for each participant conditional on observed variables
Example: Some responses for a survey question about income are missing, but the survey includes questions about age, and younger people are less likely to answer income questions than older people
- Missing not at random (MNAR)
Reason for missingness depends on the actual values of the missing data, or unobserved predictors
Example: People who have higher income are less likely to respond to income-related questions
How can we tell which mechanism is present?
In general, we don’t know, and there is not a statistical way to tell
Rare that data are MCAR, though we sometimes assume they are
Possible, perhaps likely that data are MNAR, but difficult to account for this
Generally assume/hope that data are MAR
Types of Imputation
In your group, research your assigned imputation method and create 1-2 slides in this google slide deck with the following information:
- Explain the method(s) conceptually
- Address the pros and cons of your method(s)
- Show how you can implement the method in R using the bikeshare dataset that contains simulated missingness (use the code below to access the data). After imputing the missing values with your assigned method(s), compare the results of the model without missingness (shown below) to the results obtained with your imputed dataset(s).
NOTE: Implement the methods with the basic R functions that we’ve used in the course such as lm. Do not use the MICE package to implement the methods.
Model using the complete dataset without missingness:
Imputation methods
- Mean/median imputation, complete case analysis
- Hot deck imputation, cold deck imputation (you do not need to implement these in R. Focus on the concepts and brainstorm ways that you might implement them)
- Regression imputation - deterministic
- Stochastic regression imputation - random
- Multiple imputation (you do not need to implement this in R. Focus on clearly describing the steps)
Textbook
For more information about missing data, see chapter 3 in Regression Modeling Strategies by Frank E Harrell Jr
Application exercise
For this exercise, we will use the titanic3 dataset in the Hmisc library. The dataset contains information about passengers on the Titanic, including whether or not the passenger survived.
library(tidyverse)
library(Hmisc)
getHdata(titanic3)First, fit a logistic regression model regressing the survived outcome on the predictors pclass, sex, parch, and age using complete case analysis (the default method in R). Note that parch is the number of parents/children the passenger has aboard. What is the sample size that is used for the model?
Now, let’s explore the missing data. The Hmisc package has some useful functions for exploring missing data. First, we can generate a plot to show the fraction of missing values for each variable.
na.patterns <- naclus(titanic3)
naplot(na.patterns, 'na per var')Next, we can generate a plot that shows any hierarchical structure of the missing data.
plot(na.patterns)This plot indicates that the variable age is missing when body and/or home.dest are also missing. Confirm that this is true at least for the body variable.
It is also useful to examine the characteristics of passengers that have missing data for the age variable. Fit a logistic regression model where the outcome is an indicator for whether or not the age variable is missing. Regress this outcome on sex, pclass, survived, and parch. Focusing on the p-values and the signs of the coefficient estimates, what do you observe about the pattern of missingness in the age variable?
Multiple imputation with the mice package
Recall the general 3-step process of multiple imputation:
- Replicate the dataset \(m\) times and impute the missing values on each of the \(m\) datasets. Note that the imputation method must involve some degree of randomness so that the \(m\) complete datasets are not all the same.
- Perform the analysis on each of the \(m\) datasets
- Combine the analysis results across the \(m\) datasets
The options in the mice function indicate the number of \(m\) replicated datasets (5 is the default) and the imputation method to use. “pmm” refers to predictive mean matching. Predictive mean matching combines the regression method and the hot deck imputation that we covered last week by taking a donor value from the observation with the predicted value closest to the predicted value of the observation with missing data. This is beneficial because the imputed values will still be in a plausible range.
Let’s use a subset that only contains the predictors age, pclass, parch, sex, and the outcome survived.
library(mice)
library(sjlabelled)
library(tidyverse)
titanic.sub <- titanic3 |> select(c("age","pclass","parch","sex","survived"))
titanic.sub <- unlabel(titanic.sub) #unlabel the data (labels cause problem for the mice function)
titanic.imp <- mice(titanic.sub, m=5, method="pmm", print=FALSE)The titanic.imp list contains a lot of information about the imputed values. The imp element of the list contains the actual imputed values, with an additional index for each variable that has missing data. Age is the only variable in our set with missing values. Note the dimension of this matrix. Each row represents an observation, and each column contains one of the \(m=5\) imputed values.
dim(titanic.imp$imp$age)Now let’s visualize the distribution of the observed and imputed values of age. What do you observe in this plot?
titanic.comp <- complete(titanic.imp, "long", include=TRUE) #stack the imputed values into one variable and include the observed values
titanic.comp$age.NA <- cci(titanic3$age) #create an indicator for missingness
ggplot(titanic.comp, aes(x= .imp, y=age, col=age.NA))+
geom_jitter()Now we can use the with function to fit the regression model on each of the imputed datasets
with(titanic.imp, glm(survived ~ pclass + age + sex + parch, family="binomial"))Finally, we can combine the results from these models using the pool function. How do the results compare to the complete case model?
imp.mods <- with(titanic.imp, glm(survived ~ pclass + age + sex + parch, family="binomial"))
summary(pool(imp.mods))References
Regression Modeling Strategies by Frank E Harrell Jr (Book linked above)