## use this code to access the data that contains missing values
bikedat.miss <- read.csv("https://raw.githubusercontent.com/anlane611/datasets/main/bike_miss.csv", row.names = 1)
## download from https://github.com/anlane611/datasets/blob/main/bike_full.csv
## compare results to the model that uses the full dataset (use the same variables)
bikedat <- read.csv("https://raw.githubusercontent.com/anlane611/datasets/main/bike_full.csv")Missing Data
Dos and Don’ts
Do:
Recognize that there is missingness
Investigate why if possible
Address in a suitable way
Don’t
Eliminate without investigating
Address without understanding how and potential consequences
Types/Mechanisms of Missing Data
- Missing completely at random (MCAR)
Reason for missingness does not depend on the values of the observed data
Each observation has an equal probability of missingness
Example: A survey has a front and back, and some participants have missing data for the questions on the back page
Types/Mechanisms of Missing Data
- Missing at random (MAR)
Reason for missingness depends on the values of the observed data
Probability of missingness is the same for each participant conditional on observed variables
Example: Some responses for a survey question about income are missing, but the survey includes questions about age, and younger people are less likely to answer income questions than older people
Types/Mechanisms of Missing Data
- Missing not at random (MNAR)
Reason for missingness depends on the actual values of the missing data, or unobserved predictors
Example: People who have higher income are less likely to respond to income-related questions
Investigating Missingness
Always start by exploring the missingness!
Number/proportion of observations with any missing data?
Missing in one or more variables?
Missing in outcome, predictors, or both?
Pattern of missingness? (e.g., \(X_1\) always missing when \(X_2=1\) ?)
Even the best analysis cannot make up for poor study design/data collection!
Missing Data Imputation
Definition: Replacing missing values with values that can be used in analysis
Many analysis methods/packages automatically deal with missing data in some way. Make sure you understand how the missing data are handled!
Your turn!
In your group, research your assigned imputation method and create 1-2 slides in this google slide deck with the following information:
- Explain the method(s) conceptually, include pros and cons
- Show how you can implement the method in R using the bikeshare dataset that contains simulated missingness. Compare the results of the model without missingness (shown on next slide) to the results obtained with your imputed dataset. Implement the methods with the basic R functions that we’ve used in the course such as
lm/glm. Do not use the MICE package to implement the methods.
Imputation Methods
Complete case analysis and mean/median imputation
Hot deck imputation
Cold deck imputation
Regression imputation - deterministic
Stochastic regression imputation - random
Multiple imputation
2, 3, 6: Don’t need to implement in R – focus on clearly describing steps and brainstorm ways to implement