9.14: Model selection

Learning objectives

  • understand forward and backward selection

  • understand the argument against forward and backward selection

Classwise videos

If you have questions as you watch the videos, feel free to send me an email or slack message! I will address common questions at the beginning of class.

There are no chat sessions for this lecture.

The first video explains the forward and backward model selection procedures.

The reading below presents an argument against using forward and backward selection (these methods are also known as stepwise selection). Read the full introduction (pages 1-5) and the conclusion. You can skip the methods and results sections.

Textbook

ISLR 6.1.2

Application exercise

Class exercise vote

Groups for today’s exercise

In your group, discuss the following based on the video and article:

  1. Summarize the forward and backward stepwise selection procedures.
  2. Why are forward and backward selection appealing?
  3. What statistical arguments does Smith make against stepwise selection?
  4. Consider this sentence: “The standard errors of the coefficient estimates are underestimated, which makes the confidence intervals too narrow, the t statistics too high, and the p-values too low…” Think about the relationship between standard errors and the other quantities mentioned. Why would underestimated standard errors lead to the outcomes listed?
  5. Explain how big data has renewed the interest in stepwise selection and why Smith argues that big data exacerbates the problems with stepwise selection
  6. What alternative approaches to model selection does Smith recommend?
  7. Imagine yourself working on a data science team in the future. A colleague recommends using the following two approaches. Based on the reading, which of these approaches seems reasonable? If none of them seem reasonable, how do you explain this to your colleague and how do you recommend moving forward?
    • Fit a model with pre-specified predictors and use backward selection to arrive at a “final model.” Interpret the output and draw conclusions based on the final model.
    • Fit a model with pre-specified predictors. Remove any variables with p-values that are above 0.05. Re-fit the model and interpret the output.
    • Fit simple linear regression models with each predictor individually. Then, fit a multiple linear regression model with the predictors that had significant p-values in the SLR model. Interpret the MLR output.