Data analysis assignment 3

Deadline: Friday, November 17, 11:59 PM

Generalized Linear Model Tutorial

For this assignment, you will write a tutorial for one of three generalized linear models that we have covered. Writing a tutorial will help you to solidify your conceptual understanding of the model and practice applying the model in R. You will be assigned one of three types of GLMs: multinomial, ordinal, or poisson.

Click here for your GLM assignment (you must use the one you are assigned)

Datasets

Click on the link below to download the dataset that corresponds to the model you have been assigned. Note that these are simulated datasets, which means I generated them using probability distribution functions in R. They are all clean, contain no missing data, and have three variables: Y, X1, and X2. While it is ok to use these variable names, I encourage you to create your own variable names that apply to your area of interest. You should note in your tutorial that the data are simulated. More detail about the data generation is included at the end of this page if you are interested.

Multinomial data Ordinal data Poisson data

Deliverable (4-6 pages)

Your tutorial should describe and demonstrate the GLM using the appropriate dataset. The intended audience is someone who is familiar with linear regression but unfamiliar with GLMs.

For this assignment, it is important to include code chunks in your tutorial. Your task is to explain the concepts and provide the code that your audience will need to replicate the example. Recall that you can do this with the echo: TRUE option under eval in your YAML header. Note that only code that is directly relevant to the tutorial should be included. Extraneous code such as install.packages() or debugging code should not be included. You should also show relevant “raw” output, such as the summary(model) output, to describe how to interpret it.

Your tutorial should contain the following sections and content:

Overview: Provide a few sentences describing the purpose of GLMs and the general structure of a GLM. As part of this, describe the purpose of a link function. Provide a few sentences to describe the purpose of your particular GLM. Include 2 examples of research questions that could be answered with your GLM.
Probability Distribution: Briefly describe the probability distribution that is assumed for the outcome. What is the support? What are the parameters and what values can they take?
The Model: Write out the general form of your GLM. What is the link function and why is it appropriate for that type of outcome? What are the model assumptions?
Data Example:
- Introduce the dataset. Provide a few summary statistics and/or plots. Include the fact that this is a simulated dataset.
- Fit the model. Include all relevant code, including library() for packages if needed.
- Explain how to interpret coefficient estimates for the predictors.
- Show and describe a plot that illustrates the results of the model.
- Describe how to assess the model and include the code to do so. Include how to assess any assumptions that are unique to that model (e.g., proportional odds, overdispersion). Note: For the ordinal model, assess the assumption using the method shown in the class exercise, not the hypothesis test shown in the videos.

Example tutorial

Getting started with linear regression

Data Generation

See the code below if you want to learn more about how the datasets were generated. This is not required. If you would like to change the values of the coefficients or the mean/sd of the predictor(s) to better fit your application of interest, you are welcome to do that.

References to learn more:

Simulating multinomial logistic regression data (also used for ordinal)

Simulating data for count models

# simulating data for GLMs

#first, we specify the sample size and generate the predictors
n <- 344
x1 <- rnorm(n=n, mean=20, sd=3)
x2a <- runif(n=n, min=0, max=1)
x2 <- ifelse(x2<0.3,1,0)

#for each model, we generate the outcome based on the specified model

#poisson
lambda.out <- exp(-3.5 + 0.2*x1 + 0.5*x2) #first generate the mean
y <- rpois(n=n, lambda=lambda.out) #use the mean to generate the poisson outcome

pois_dat <- data.frame(Y=y,X1=x1,X2=x2) #create the dataframe


#multinomial

#recall that multinomial fits different logistic models. Here, we want an outcome with 3 levels, so we specify two different logistic models below.
lp2 <- 3 + -0.2*x1 + -0.7*x2
lp3 <- -4 + 0.2*x1 + -0.3*x2

den <- (1+exp(lp2)+exp(lp3)) #ensures the probabilities sum to 1
#we generate a probability of each outcome level for each subject
p1 <- 1/den
p2 <- exp(lp2)/den
p3 <- exp(lp3)/den

p <- cbind(p1, p2, p3)
head(p) #shows the matrix structure of the probabilities

#the apply function takes each row of probabilities and generates and outcome based on those probabilities for each subject.
y.mult <- apply(p, MARGIN=1, function(x) sample(x=1:3, size=1, prob=x))
mult_dat <- data.frame(Y=y.mult,X1=x1,X2=x2)

#ordinal - same as multinomial above, I just used different coefficients
lp2 <- 4 + -0.2*x1 + -0.6*x2
lp3 <- -4 + 0.2*x1 + -0.2*x2

den <- (1+exp(lp2)+exp(lp3))
p1 <- 1/den
p2 <- exp(lp2)/den
p3 <- exp(lp3)/den

p <- cbind(p1, p2, p3)
head(p)

y.ord <- apply(p, MARGIN=1, function(x) sample(x=1:3, size=1, prob=x))
ord_dat <- data.frame(Y=y.ord,X1=x1,X2=x2)
table(y.ord)

Stellar Tutorials

The teaching team selected 3 excellent tutorials (one for each type of GLM) that you can access to reference in the future should the need arise to use one of these models. You can access them via Duke box here: Duke box folder for tutorials