IDS 702 - Fall 2025
  • Home
  • Schedule and Course Materials

On this page

  • Dos and Don’ts
  • Types/Mechanisms of Missing Data
  • Types/Mechanisms of Missing Data
  • Types/Mechanisms of Missing Data
  • Investigating Missingness
  • Missing Data Imputation
  • Your turn!
  • Bikeshare model without missingness
  • Bikeshare model without missingness
  • Imputation Methods

Other Formats

  • RevealJS

Missing Data

Dos and Don’ts

Do:

  • Recognize that there is missingness

  • Investigate why if possible

  • Address in a suitable way

Don’t

  • Eliminate without investigating

  • Address without understanding how and potential consequences

Types/Mechanisms of Missing Data

  1. Missing completely at random (MCAR)
  • Reason for missingness does not depend on the values of the observed data

  • Each observation has an equal probability of missingness

  • Example: A survey has a front and back, and some participants have missing data for the questions on the back page

Types/Mechanisms of Missing Data

  1. Missing at random (MAR)
  • Reason for missingness depends on the values of the observed data

  • Probability of missingness is the same for each participant conditional on observed variables

  • Example: Some responses for a survey question about income are missing, but the survey includes questions about age, and younger people are less likely to answer income questions than older people

Types/Mechanisms of Missing Data

  1. Missing not at random (MNAR)
  • Reason for missingness depends on the actual values of the missing data, or unobserved predictors

  • Example: People who have higher income are less likely to respond to income-related questions

Investigating Missingness

  • Always start by exploring the missingness!

    • Number/proportion of observations with any missing data?

    • Missing in one or more variables?

    • Missing in outcome, predictors, or both?

    • Pattern of missingness? (e.g., \(X_1\) always missing when \(X_2=1\) ?)

  • Even the best analysis cannot make up for poor study design/data collection!

Missing Data Imputation

Definition: Replacing missing values with values that can be used in analysis

Many analysis methods/packages automatically deal with missing data in some way. Make sure you understand how the missing data are handled!

Your turn!

In your group, research your assigned imputation method and create 1-2 slides in this google slide deck with the following information:

  1. Explain the method(s) conceptually, include pros and cons
  2. Show how you can implement the method in R using the bikeshare dataset that contains simulated missingness. Compare the results of the model without missingness (shown on next slide) to the results obtained with your imputed dataset. Implement the methods with the basic R functions that we’ve used in the course such as lm/glm. Do not use the MICE package to implement the methods.

Bikeshare model without missingness

## use this code to access the data that contains missing values
bikedat.miss <- read.csv("https://raw.githubusercontent.com/anlane611/datasets/main/bike_miss.csv", row.names = 1)
## download from https://github.com/anlane611/datasets/blob/main/bike_full.csv

## compare results to the model that uses the full dataset (use the same variables)
bikedat <- read.csv("https://raw.githubusercontent.com/anlane611/datasets/main/bike_full.csv")

Bikeshare model without missingness

summary(glm(bikers ~ windspeed + hum + temp, 
            family="poisson", data=bikedat))

Call:
glm(formula = bikers ~ windspeed + hum + temp, family = "poisson", 
    data = bikedat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.45017    0.02643  168.40   <2e-16 ***
windspeed   -0.38077    0.04459   -8.54   <2e-16 ***
hum         -0.87922    0.02676  -32.85   <2e-16 ***
temp         2.10498    0.02852   73.81   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 32460  on 249  degrees of freedom
Residual deviance: 25631  on 246  degrees of freedom
AIC: 27131

Number of Fisher Scoring iterations: 5

Imputation Methods

  1. Complete case analysis and mean/median imputation

  2. Hot deck imputation

  3. Cold deck imputation

  4. Regression imputation - deterministic

  5. Stochastic regression imputation - random

  6. Multiple imputation

2, 3, 6: Don’t need to implement in R – focus on clearly describing steps and brainstorm ways to implement