Homework 5

Due: Monday, November 24th 11:59 PM

Using the provided Qmd template, complete the following exercises and submit the document with your answers on Gradescope, which you can access through Canvas. You must show your work for all problems, and you must provide a written answer for all problems; it is insufficient to just show the code.

Data

This assignment uses data on global deaths from natural disasters from 1960 to 2014. You can find the original data here (except for population counts, which you can find here).

Notice the difference in structure between the original data and the version you load below. You can see how exactly I cleaned, reformatted, and combined these datasets by downloading this R script.

library(tidyverse)
library(glmmTMB)
## NOTE: Run install.packages("glmmTMB", type = "source") IN CONSOLE

disaster_deaths <- read.csv("https://raw.githubusercontent.com/anlane611/datasets/refs/heads/main/disaster_deaths_cleaned.csv")

Exercises

First, explore the data:
1. Generate a scatter plot with Year on the x-axis, Deaths on the y-axis, and color by Region. Comment on what you observe in the plot.
2. Calculate the count and proportion of rows/death counts that are zero and nonzero.
First, fit an appropriate model to regress death count on year and region, and allow for the death count over time to vary by region.
1. Display the summary table and the exponentiated coefficient estimates, and interpret the results in a general sense. Note that because you have an interaction term, you should add/subtract the appropriate coefficient estimates before exponentiating).
2. Compare the AIC values between this model and a negative binomial model, as shown in class. What can you say about the assumption of a poisson model from this comparison?
Next, fit an appropriate model that incorporates population count to assess the death rate over time, again allowing this relationship to vary by region. Display the summary table and the exponentiated coefficient estimates, and interpret the results in a general sense as you did in 2a.
With such a large proportion of zeros, a better modeling option can be to fit a zero-inflated poisson model. This model fits two submodels simultaneously: one logistic regression model where the outcome is that the count is zero vs nonzero, and a poisson regression model that models the outcome count among the nonzero entries. Use the code below to fit the zero-inflated poisson model. The “Conditional model” part of the summary output refers to modeling the count outcome among the nonzero entries, and the “Zero-inflation model” part refers to the logistic model.

Note that the Year variable causes instability here because, being treated as numeric, the magnitude of the variable is large. So, we will first create a variable that counts from the first year in the dataset (1960).
```
disaster_deaths <- disaster_deaths |>
  mutate(Year_1 = Year - min(Year)+1)

zip_model_nb <- glmmTMB(
  Deaths ~ Region*Year_1,
  ziformula = ~ Region*Year_1,
  family = poisson,
  data = disaster_deaths
)

summary(zip_model)
```
1. Change family to “nbinom2” in the zero-inflated model code to fit a zero-inflated negative binomial model. Compare the AIC values. What can you say about the assumption of a poisson model from this comparison?
2. Compare the results of the most appropriate model based on part a to the model that you fit in #2. (Note that you can extract coefficient estimates using fixef(zip_model)$cond

Bonus

What other modeling assumption could be violated with this dataset, and how could it be addressed? You do not need to implement this change. (Hint: a bonus question from a previous homework will be helpful!)