Homework 3 Part 2

Due: Friday, October 17th 11:59 PM

Using the provided Qmd template, complete the following exercises and submit the document with your answers on Gradescope, which you can access through Canvas. You must show your work for all problems, and you must provide a written answer for all problems. For example, you should write “There are XX observations and each observation represents a _____”; it is insufficient to just show the code.

This assignment will use a modified version of a College Completion dataset. You can read more about the dataset here. (or here, if that link doesn’t work for you). The relevant portion of the codebook is provided below.

You want to understand factors that affect graduation rate.
Variable	Description
control	Control of institution (Public, Private not-for-profit, Private for-profit)
basic	Carnegie Foundation for the Advancement of Teaching Basic Classification (2010 version)
student_count	Total number of undergraduates in 2010
ft_pct	The percentage of full-time students
med_sat_value	The median estimated SAT score
endow_value	End-of-year endowment value per full-time equivalent student
grad_100_value	The graduation rate within 100% of normal time
grad_150_value	The graduation rate within 150% of normal time
retain_value	share of freshman students retained for a second year

Questions

Use the complete case dataset with factored categorical variables created in HW 3 part 1.

First, consider the model regressing graduation rate within 150% of normal time on student count and control (public vs private).
1. Fit the model.
2. Generate the residual vs fitted and QQ diagnostic plots for the model. Comment on what you observe in the residuals vs fitted plot and the QQ-plot.
  1. Based on the diagnostic plots, does the linearity assumption appear to be violated?
  2. Based on the diagnostic plots, does the normality assumption appear to be violated?
  3. Based on the diagnostic plots, does the homoscedasticity assumption appear to be violated?
  4. Based on the scatter plot you generated in HW 3 part 1 with these variables, are these diagnostic plots surprising? Why or why not?
3. Generate two scatter plots of student count vs graduation rate for private vs public institutions, but use log(student count) on the x-axis for one plot and sqrt(student count) for the other. Based on what you observe in these plots, which transformation might be best to improve linearity?
4. Fit the model again using the transformation you selected in part c. Generate the diagnostic plots for this model. Comment on what you observe in the residuals vs fitted plot and QQ-plot here compared to what you saw in part b.
5. What are the adjusted \(R^2\) values for the two models (part a and part d)?
6. Based on the results for parts a-e, which of these two models do you think is better? Display the summary table and 95% confidence intervals. Then, write the final fitted model.
7. Write an interpretation for the coefficient estimates, p-values, and 95% confidence intervals in the context of the problem. You do not need to write interpretations for the intercept.
8. What are some possible next steps to further improve this model? You do not need to fit a new model.

Next, consider the model that assesses how the relationship between full-time student percentage and graduation rate within 150% of normal time depends on institution control (public vs private), controlling for retention rate and median SAT score.
1. Fit the model and generate the diagnostic plots. Comment on what you observe. Specifically, address the following:
  1. Based on the diagnostic plots, does the linearity assumption appear to be violated?
  2. Based on the diagnostic plots, does the normality assumption appear to be violated?
  3. Based on the diagnostic plots, does the homoscedasticity assumption appear to be violated?
2. Display the summary table. Write the fitted models for public and private institutions. Is the relationship between percentage of full-time students and graduation rate stronger for public or private institutions?
3. What is the adjusted \(R^2\) for this model? Write an interpretation for this metric.
4. Fit the same model but with a log transformation of the outcome. Generate the diagnostic plots. Are the diagnostics improved compared to part a?

Bonus

One way to address heteroscedasticity in linear models is by using robust standard errors, which account for non-constant variance. Use this page to learn about robust standard errors and how to use the sandwich and lmtest packages to incorporate them into a linear model in R. Choose the model from 1 or 2 for which robust standard errors would be more appropriate. Refit the model using these packages and compare the results to what you originally observed.