Do:
Recognize that there is missingness
Investigate why if possible
Address in a suitable way
Don’t
Eliminate without investigating
Address without understanding how and potential consequences
Definition: Replacing missing values with values that can be used in analysis
Many analysis methods/packages automatically deal with missing data in some way. Make sure you understand how the missing data are handled!
Only use observations without any missing data in the relevant variables
R automatically does this
Replace the missing value with the mean/median for continuous variables or the mode for categorical variables
Pro: Easy to implement
Con: Can bias the results
Replace the missing value with a value from an external source (e.g., population size for a certain year)
Pros: Can be confident in accuracy
Cons: Often difficult to find a suitable external source
Replace the missing value with the value from a similar observation
Pros: Does not rely on model fitting, only plausible values can be imputed
Cons: Can be difficult to select “similar” observations
Build a model for the variable with missingness based on observed predictors
Replace the missing value with the predicted value from the model
Pros: Uses the relationships in the data, relatively easy to implement, suitable for continuous or categorical data
Cons: Relies on model assumptions, may not perform well for rare categories
Deterministic: Take the predicted value as is
Stochastic: Add an element of randomness to the predicted value to preserve variability and reduce bias