9.12: Problems that can arise

Learning objectives

interpret plots of leverage and Cook’s distance
understand the problem of multicollinearity

Classwise videos

If you have questions as you watch the videos, feel free to send me an email or slack message! I will address common questions at the beginning of class.

Textbook

ISLR 3.3.3

Application exercise

For this exercise, we will use the Hitters dataset from the ISLR2 package

library(ISLR2)
data("Hitters")
?Hitters #run this for the data dictionary

We want to understand how hitter statistics are associated with salaries. More specifically, we want to assess the model regressing Salary on the following predictors: Hits, HmRun, Runs, RBI, Years, CHits, CHmRun, CRuns, and CRBI. Let’s first create a correlation plot of the 9 predictors. Comment on what you observe in the correlation plot.
```
library(corrplot) #you will need to install this package
corrplot(cor(Hitters[,c("Hits", "HmRun", "Runs", "RBI", "Years", 
                        "CHits", "CHmRun","CRuns","CRBI")]))
```
Now, fit the model. Take note of the output.
Use the vif() function in the car package to calculate VIF values for each predictors. Do any of the predictors have high VIF values? If so, which one(s)?
Fit another model but without the career variables (i.e., using Hits, HmRun, Runs, RBI, and Years). Note differences in the model output for these predictors between this model and the model you fit in #2 in terms of estimates, standard errors, and p-values.
Look at the model diagnostics plots.
- Are there any influential points? If so, fit the model without those observation(s) and note any differences in the summary output and/or the diagnostic plots.
- Do you notice any other potential violations in model assumptions? If so, what adjustments could be made?

Key takeaways

From the correlation plot, you should see that the single-season variables and the career variables are highly correlated with each other
We see very high VIF values for the career variables. I recommended removing those because they are so high and much higher than the other predictors. I encourage you not to rely strictly on the >10 threshold.
After removing the career variables, we see that the standard errors for the single-season variables generally decrease. Perhaps the most notable and intuitive difference in the output is in the Years variable. We would expect that salaries generally increase as years in the league increase. In the first model, we had high collinearity between the Years variable and the career variables, and the estimate was negative. After removing the career variables with which the Years variable was highly correlated, the standard error for Years is quite a bit lower and the estimate is positive (and statistically significant)
For the model diagnostic plots, we see potential violation of homoscedasticity and normality. If we log-transform the outcome, the model diagnostic plots are more “cloud-like.” Log-transformations are often necessary when outcome variables have to do with money (salary, price, income, etc.).
We see a few outliers and high-leverage points, though no influential points identified with Cook’s distance. If we remove Mike Schmidt, Rickey Henderson, and Terry Kennedy, the model output is slightly different but we don’t see substantial differences (i.e., conclusions drawn from p-values are not different)