9.26: Wrapping up MLR

Learning objectives

  • describe the basics of the bias-variance tradeoff

  • describe Simpson’s paradox

  • explain when it is useful to standardize predictors in linear regression

Classwise videos

These videos cover three miscellaneous topics to wrap up our conversation about linear regression.

If you have questions as you watch the videos, feel free to send me an email or slack message! I will address common questions at the beginning of class.

The first video introduces the bias-variance tradeoff. This is more relevant to machine learning than inferential statistical modeling, so you will learn more about it next semester. However, I want to go ahead and introduce the concept.

The second video discusses Simpson’s paradox. This relates to the concept of confounding and emphasizes the importance of using multiple linear regression rather than simple linear regression. The video also covers tidyverse functions group_by() and count() which will be useful as you explore data.

Finally, see this tutorial on standardizing predictors. Focus on the z-score and centering, and the sections on when and when not to standardize. Note that this is much more relevant when prediction is the goal. Standardizing predictors changes the interpretation of the coefficient estimates; therefore, when inference/interpreting estimates the goal, we generally do not standardize.

Textbook

Bias-variance tradeoff: ISLR 2.2.2

Application Exercise

Simpson’s Paradox

  1. Air travelers would like their flights to be on time. Airlines collect data about on-time arrivals and report them to the Department of Transportation (DoT). Here is one month’s data for flights for two airlines from five western cities.
Airline On time Late
Alaska Airlines 3274 501
America West 6438 787
  • Calculate the percentage of flights that are on time for each airline and the percentage of flights that are late for each airline.

  • Based on the calculated percentages, which airline has the better on-time record?

In the table below, the data are broken down by city.

City AA - on time AA - late AW - on time AW - late
Los Angeles 497 62 694 117
Phoenix 221 12 4840 415
San Diego 212 20 383 65
San Francisco 503 102 320 129
Seattle 1841 305 201 61
Total 3274 501 6438 787
  • Compute the percentage on-time and percentage late values for each airline and city

  • Which airline has the best on-time record for each city?

    • Los Angeles:

    • Phoenix:

    • San Diego:

    • San Francisco:

    • Seattle:

  • Explain why this is an example of Simpson’s paradox

  1. Explain how the following image relates to Simpson’s paradox.
  2. Simpson’s paradox with COVID vaccination data

References

Simpson’s paradox airline activity: http://facweb.cs.depaul.edu/brewster/lsp121/Files/Activity%206-ans.pdf

Simpson’s paradox image: https://stats.stackexchange.com/questions/478463/examples-of-simpsons-paradox-being-resolved-by-choosing-the-aggregate-data

Key Takeaways

  • The key takeaway of this exercise is that aggregated data can mask the relationship between two variables that changes with the inclusion of a third variable. This is one reason that we do multiple linear regression instead of simple linear regression.

  • The exercise also helps to develop statistical literacy. When you see aggregated data/percentages in the news, consider if it might be misleading. Is there a third variable that would help to shed light on the relationship between the two variables?