9.07: Addressing the assumptions

Learning objectives

  • list the four assumptions of MLR

  • explain why we use residual plots to assess the assumptions

  • interpret residual plots

Classwise videos

If you have questions as you watch the videos, feel free to send me an email or slack message! I will address common questions at the beginning of class.

I know many students are having trouble with classwise recognizing videos as completed. I will not count classwise completion grades until we sort out the technical issues.

Textbook

ISLR sections 3.1-3.3 (specifically 3.3.3)

Application exercise

Access the Auto dataset:

library(tidyverse)
library(ISLR2)
data(Auto)
  1. Use this page as a data dictionary. Identify which variables are categorical. Are they stored correctly in R? If not, create factor variables.
  2. Fit a model regressing mpg on horsepower, displacement, acceleration, cylinders, and origin.
    • What do you notice about the standard errors of the cylinders levels compared to the other predictors?
    • Use the count() function to generate a table showing how many cars are in each cylinder level. Only having a few observations for a particular level of a categorical variable causes problems for the model. We can either 1) exclude those observations, or 2) create a new variable combining levels. Decide which option you would like to use here and clean the data accordingly. (Note: if you choose to exclude the observations, you can use the filter() function to subset the data).
    • Look at the residual and qq plots for this model. What do you observe? Do any of the regression assumptions appear to be violated here? If so, which one(s)? Note that you can generate certain plots using the which option, i.e., plot(model_object, which=1) generates the residual plot, and plot(model_object, which=2) generates the qq-plot.
  3. Generate a plot of mpg and displacement. Do you think a transformation of the displacement variable might be appropriate? If so, which one and why? Then, add color to your plot for cylinders. Is an interaction term appropriate? Finally, replace cylinders to color by origin. Is an interaction term appropriate?
  4. Repeat #3 for the other predictor variables: acceleration and horsepower
  5. Decide if you want to use a transformation on displacement, acceleration, and/or horsepower, or interaction term(s). Consider the implications of estimate interpretations. Fit the model with the additional terms that you choose, and generate the residual and qq plots. What do you notice compared to what you saw in #2?
  6. Fit the model using log(mpg) as the outcome. Generate the residual and qq plots and comment on the difference(s) that you observe. Consider the implications for model interpretation. Do you think the transformation of the outcome variable is useful here?

Key takeaways

  • There is an argument to be made for treating the cylinders variable as either numeric or categorical. To me, it seemed better suited as a categorical variable. A takeaway here is that when a categorical variable has very few observations for a particular level, the standard error is inflated

  • We see that the linearity assumption appears to be violated. Upon further exploratory inspection, we see a non-linear relationship between mpg and displacement and horsepower. The relationship we see indicates that a log transformation of these predictor variables is necessary. However, when we look at the relationship with cylinder and origin, we see that an interaction term could also be used. Either one will improve the residual plot, so at this point we might consider which one is more interpretable. Since we are interested in inference, I would lean toward using an interaction term, which is easier to interpret than a log transformation.

  • A log transformation of the outcome variable will improve the qq-plot. If we do this, the coefficient estimates are interpreted as multiplicative instead of additive (see this link for helpful info on interpreting coefficients after log transformations). Since the violation isn’t too bad, and linear regressions are robust to violations of normality, I would probably forego using the transformation here. However, it’s not incorrect to use it.