9.19 and 9.21: Prediction

Learning objectives

  • distinguish between inference and predictive modeling

  • use cross validation to assess predictive models

Classwise videos

If you have questions as you watch the videos, feel free to send me an email or slack message! I will address common questions at the beginning of class.

Having technical issues, but you can click the links to watch the videos and download the annotated notes.

Video 1: Prediction intro

Video 2: Prediction modeling metrics

Video 3: Cross validation

Notes:

Video 1
Video 2
Video 3

Textbook

ISLR 2.1.1, 2.1.3, 2.2

Application Exercise (9/19)

Groups for today’s exercise

Today’s exercise will help you to solidify your understanding of (R)MSE and k-fold cross validation by “manually” implementing them in R.

You should be familiar with the following coding concepts: writing functions, for loops, and if-else statements. If you are not familiar with how to write these things in R, consult the R for Data Science resource (Functions, Iteration).

Write a function that performs \(K\)-fold cross validation. The function should take 3 inputs: a dataset, a value of \(K\), and an evaluation metric. The options for the evaluation metric should be adjusted \(R^2\) , MSE, and RMSE.

The end goal is to be able to compare predictive ability of different models. For example, you may want to use your function with the Auto dataset to do the following:

library(ISLR2)
data("Auto")

#you can name your function whatever you'd like. Note that to compare different models, we want to use the same metric and value of K. You can specify the models you want to compare within the function. For example, you may want to compare three models with the Auto dataset: mpg regressed on displacement, weight, and acceleration; mpg regressed on displacement, weight, acceleration, and a factor variable for cylinder; and mpg regressed on displacement, weight, acceleration and interaction terms for displacement and factored cylinders and weight and factored cylinders. You can also take the extra step of making the function more general for comparing models, but that is more complicated.
crossval(Auto, K=10, metric="RMSE")

You are welcome to make the function more complex (additional inputs, etc). For example, maybe you’d like your function to take multiple models and return the model with the best metric (lowest MSE/RMSE or highest Adj \(R^2\) )

You should consider using the following:

  • the predict() function takes a model object and “newdata” and returns all of the predicted values. You can use this to calculate test MSE and RMSE. Use the formula for MSE and RMSE and nested functions to calculate each of these in a single line of code. Run ?predict in the console to learn more.

  • Adjusted \(R^2\) can be accessed in the summary(lm()) output, i.e., summary(mod_object)$adj.r.squared

  • The caret package has a function called createFolds() that you can use in your function to create the \(K\) folds in the given dataset. You can also try doing this manually with the sample(), cut(), or split() functions.

Going Further

If you have extra time, you can consider some extensions:

  • Add function inputs to make it more general. For example, the user could specify models to compare

  • Use the trainControl() and train() functions from the caret package to perform k-fold cross validation. Compare the results to what you get using your function.

  • Simulate data and use your function to better understand the effects of sample size and value of \(k\). The rnorm() function generates a vector of random values from the normal distribution. You can specify the mean and standard deviation of the distribution from which you take random draws. You can then add outliers to your predictors to see the effect of small samples and extreme values in the cross validation splits.

Application Exercise (9/21)

Use the same groups that you were in on Tuesday.

Note: Focus on #1-4. Particularly for folks who are new to stats/R, skip the “go further” parts and come back to them if you have time.

For this exercise, we will use the Boston dataset from the ISLR2 package.

Click here for the data dictionary

We want to build a model to predict the median home value from various predictors.

  1. Start by using the data dictionary to determine if any data cleaning needs to be done. Is there any missing data?

  2. Fit the full model. The code below uses the caret package to perform k-fold cross validation and produces RMSE, \(R^2\) , and MAE. The trainControl function establishes the method you will use to evaluate your model. In this case, we use the cross validation option for method (“cv”) and set number=10 for 10-fold cross validation. Then, the train function performs 10-fold CV on the indicated model. Note that the model syntax is the same as what we’ve used in lm (the “.” indicates that we want to regress on all variables in the dataset except the outcome). Then we specify method="lm" for a linear model, and trControl=train_control to use the 10-fold CV method we stored in the train_control object. Finally, the print function gives us the model metrics.

    • Go further: Run ?R2 in the console to find out how caret calculates \(R^2\) here. This may help to provide a more intuitive interpretation of out-of-sample \(R^2\)

      #note that cross validation involves a random sampling component, so we should use a seed for reproducibility
      set.seed(921) 
      
      train_control <- trainControl(method = "cv",
                                                number = 10)
      
      mod_full <- train(medv ~., data = Boston,
                           method = "lm",
                           trControl = train_control)
      
      #get model metrics
      print(mod_full)
  3. Look at the diagnostics for the full model. Which assumptions appear to be violated (if any)? Would it help to transform the outcome or predictor variable(s)? Would an interaction term help? If our focus is prediction, how much should we prioritize model interpretability when we’re considering variable transformations?

    • Go further: There are several R packages that can help efficiently visualize your data. As we’ve seen before, visualizing can help determine if variable transformations/interaction terms may improve the model. You can explore packages like DataExplorer and SmartEDA.
  4. After you determine which transformations/interaction terms improve the model diagnostics, perform cross validation again on your new model. Do you notice a reduction in RMSE?

  5. We can also use cross validation to perform variable selection. This can help to determine which combination of features leads to the best predictions. The freControl function establishes that we will use linear regression and cross validation. The rfe function performs the variable selection procedure. Using cross validation, we will assess the out-of-sample RMSE for a 1-variable model, 2-variable model, etc, and select the optimal predictors based on the lowest RMSE.

    set.seed(921)
    
    ctrl <- rfeControl(functions = lmFuncs,
                       method = "cv")
    
    # standardize the predictors so that they are comparable
    # You will need to edit this code if you transformed any variables or added interaction terms
    boston_standardize <- as.data.frame(scale(Boston))
    boston_standardize <- boston_standardize |> select(-c("medv"))
    
    lmProfile <- rfe(boston_standardize, Boston$medv, #specify the X and Y variables
                     sizes=c(1:12), # specify the number of variables you want to compare. Here we are assessing all possible numbers from 1-12 predictor variables
                     rfeControl = ctrl)
    
    lmProfile #this gives the model metrics and shows which number of variables has been selected as optimal
    predictors(lmProfile) #this lists the predictors that were selected

Key Takeaways

  • The first exercise is designed to help you think through the steps of cross validation. Below is an example function (provided by one of your classmates 😃)
#first, a function is written to calculate MSE
MSE <- function(real_value, estimated_value){
    result = (1/length(real_value))*sum((real_value-estimated_value)^2)
    return(result)
}

crossval <- function(df, k) { #CV function takes the data & # folds
  metrics = c()
  split_lst = df %>% #split the dataset into folds (could also use createFolds)
    group_by((row_number()-1) %/% (n()/k)) %>%
    nest %>%
    pull(data)
  for (i in 1:k) { #loop through folds
    test_df <- split_lst[[i]]
    train_df <- bind_rows(split_lst[-i]) #designates train & test set
    model <- lm(data=train_df, mpg ~ acceleration + horsepower)
    #calculate predicted values for MSE
    preds = predict(model, test_df[c('acceleration', 'horsepower')])
    mse = MSE(test_df[c('mpg')], preds) #use MSE function 
    metrics <- c(metrics, mse) #store test MSE for all values of k
  }
  return(meanMSE=mean(metrics)) #output the mean test MSE
}

crossval(df=Auto, k=10) #run the function
  • The second exericse walks through the implementation of cross validation using the caret package. Note that with prediction problems, we don’t prioritize interpretation of the model. So, there is more flexibility to add variable transformations to improve model diagnostics. The biggest takeaway here was that adding transformations improve model diagnostics and reduce RMSE, thereby improving prediction accuracy of the model.

  • We can also perform variable selection with cross validation, which is a different process than the model selection procedures we introduced last week. Forward/backward/stepwise selection base inclusion decisions on p-values, which is frowned upon in the statistics community. The procedure in this exercise bases variable selection on cross validation and RMSE to improve predictions.