9.05: Estimating coefficients and assessing the model

Learning objectives

  • define “sum of squared errors”

  • describe the concept of ordinary least squares

  • describe (adjusted) \(R^2\) and how it relates to model assessment

Classwise videos

If you have questions as you watch the videos, feel free to send me an email or slack message! I will address common questions at the beginning of class.

I know many students are having trouble with classwise recognizing videos as completed. I will not count classwise completion grades until we sort out the technical issues.

The third video is optional. This video shows the derivation of the OLS estimators.

Textbook

ISLR sections 3.1-3.3

Application exercise (to complete during the class meeting)

Palmer penguins data

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

library(tidymodels)
library(tidyverse)
library(palmerpenguins)
  1. Use the glimpse() output to determine the following:

    • sample size and number of variables

    • which variables are categorical and which are numeric? is there any missing data? are the categorical variables stored appropriately?

  2. Select an appropriate outcome and primary (continuous) predictor of interest.

    • generate a scatter plot for the two variables (remember to label your axes!)

    • color the scatter plot by sex. Does an interaction term seem appropriate?

    • color the scatter plot by species. Does an interaction term seem appropriate?

  3. Fit a model regressing the outcome variable you selected onto the primary predictor of interest, sex, and species.

    • Write interpretations for the coefficient estimates, p-values, and confidence intervals

    • Is the species variable statistically significant? Conduct the appropriate test.

      • Note: The anova function computes a nested F test:
mod_reduced <- #reduced model object here
mod_full <- #full model object here

anova(mod_reduced, mod_full, test="F")
  1. Add an interaction term that seems appropriate based on the EDA from #2. Interpret the p-value and coefficient estimate for the interaction term in the context of the dataset.

  2. Interpret the adjusted \(R^2\) value for the model with the interaction term. Compare the value to the adjusted \(R^2\) value obtained from a model that does not include the interaction term. What do you conclude from this comparison?

  • Note: You can see the \(R^2\) and adjusted \(R^2\) values in the summary(model_object) output that we used last week

Key takeaways

  • This will vary depending on the variables you selected, but you probably saw that an interaction term appears to be necessary for the species variable but not necessarily sex. We see this because it appears that the slope is different for at least one species compared to the others.

  • When an interaction term is meaningful, you should see an increase in the adjusted \(R^2\) value for the model with the interaction term compared to the model without. This means that the interaction term increases how much variance of the outcome is explained by the predictors.