Missing Data

Dos and Don’ts

Do:

Recognize that there is missingness
Investigate why if possible
Address in a suitable way

Don’t

Eliminate without investigating
Address without understanding how and potential consequences

Types/Mechanisms of Missing Data

Missing completely at random (MCAR)

Reason for missingness does not depend on the values of the observed data
Each observation has an equal probability of missingness
Example: A survey has a front and back, and some participants have missing data for the questions on the back page

Types/Mechanisms of Missing Data

Missing at random (MAR)

Reason for missingness depends on the values of the observed data
Probability of missingness is the same for each participant conditional on observed variables
Example: Some responses for a survey question about income are missing, but the survey includes questions about age, and younger people are less likely to answer income questions than older people

Types/Mechanisms of Missing Data

Missing not at random (MNAR)

Reason for missingness depends on the actual values of the missing data, or unobserved predictors
Example: People who have higher income are less likely to respond to income-related questions

Investigating Missingness

Always start by exploring the missingness!
- Number/proportion of observations with any missing data?
- Missing in one or more variables?
- Missing in outcome, predictors, or both?
- Pattern of missingness? (e.g., \(X_1\) always missing when \(X_2=1\) ?)
Even the best analysis cannot make up for poor study design/data collection!

Missing Data Imputation

Definition: Replacing missing values with values that can be used in analysis

Many analysis methods/packages automatically deal with missing data in some way. Make sure you understand how the missing data are handled!

Your turn!

In your group, research your assigned imputation method and create 1-2 slides in this google slide deck with the following information:

Explain the method(s) conceptually, include pros and cons
Show how you can implement the method in R using the bikeshare dataset that contains simulated missingness. Compare the results of the model without missingness (shown on next slide) to the results obtained with your imputed dataset. Implement the methods with the basic R functions that we’ve used in the course such as lm/glm. Do not use the MICE package to implement the methods.

Bikeshare model without missingness

## use this code to access the data that contains missing values
bikedat.miss <- read.csv("https://raw.githubusercontent.com/anlane611/datasets/main/bike_miss.csv", row.names = 1)
## download from https://github.com/anlane611/datasets/blob/main/bike_full.csv

## compare results to the model that uses the full dataset (use the same variables)
bikedat <- read.csv("https://raw.githubusercontent.com/anlane611/datasets/main/bike_full.csv")

Bikeshare model without missingness

summary(glm(bikers ~ windspeed + hum + temp, 
            family="poisson", data=bikedat))


Call:
glm(formula = bikers ~ windspeed + hum + temp, family = "poisson", 
    data = bikedat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.45017    0.02643  168.40   <2e-16 ***
windspeed   -0.38077    0.04459   -8.54   <2e-16 ***
hum         -0.87922    0.02676  -32.85   <2e-16 ***
temp         2.10498    0.02852   73.81   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 32460  on 249  degrees of freedom
Residual deviance: 25631  on 246  degrees of freedom
AIC: 27131

Number of Fisher Scoring iterations: 5

Imputation Methods

Complete case analysis and mean/median imputation
Hot deck imputation
Cold deck imputation
Regression imputation - deterministic
Stochastic regression imputation - random
Multiple imputation

2, 3, 6: Don’t need to implement in R – focus on clearly describing steps and brainstorm ways to implement