Data analysis assignment 2

Deadlines

First due date for peer review: Thursday, October 19 in class (Log in to canvas.duke.edu with your Duke netID and click “Assignments > Data analysis assignment 2 - peer review” to submit your assignment for peer review. You can do this before class or during the first few minutes of class)

Final product due date: Friday, October 27, 11:59 PM

Logistic regression model documentation

An important part of being a data scientist is documenting your modeling process. You are often not the only person who will analyze a given dataset. A data scientist who analyzes data after you should be able to clearly understand (and be able to replicate) your modeling decisions.

For this assignment, you will be assigned one of two datasets. Your deliverable will be a model documentation report for your dataset. Then, you will swap with someone with the other dataset, and you will each evaluate the documentation based on clarity. After the peer review, you will be able to revise your documentation before final submission.

Given this structure, your grade will be based on the following three components:

  1. Quality of your peer review for a classmate (Oct 19)
  2. Feedback from peer review of your initial document (Oct 19)
  3. Your final documentation report (due Oct 27)

The datasets

Click here for the dataset assignments (you must use the one you are assigned).

This dataset contains a subset of the official results of the 2020 Financial Independence Survey from Reddit.
Research Question: Which factors contribute to whether or not someone considers themselves to be financially independent?
Click here for the data dictionary and source information
You can access the data using the openintro library:
library(openintro)
data("reddit_finance")

This dataset contains experimental data from a study seeking to understand factors that influence whether or not a resume is selected for a callback.
Research Question: How do race and gender influence job application callback rates?
Click here for the data dictionary and source information
You can access the data using the openintro library:
library(openintro)
data("resume")

Deliverables:

Model documentation (3-5 pages)

Your model documentation should demonstrate how you analyzed the dataset and justify the decisions you made during the analysis. The intended audience is someone who is familiar with statistical modeling but new to the project/dataset. You can choose how to organize and format your report. For example, you may choose to use bullet points in some sections instead of paragraphs. You may also include code chunks in the documentation, but presenting code should have a clear purpose.

Your documentation should cover the following elements:

  • Overview: Provide an overview of the dataset and the goals of the analysis. Provide basic information about the data (e.g., sample size).

  • Data cleaning: Which variables required cleaning? Are there any missing values? Did you make any assumptions during the data cleaning process?

  • Modeling:

    • Justify the choice to use logistic regression to answer the research question.

    • Variable selection: Which variables did you include in your model? Start with a priori variable selection, keeping in mind the principle of confounding. If you decide to exclude variables after the initial a priori selection, explain why/how you selected the variables. You do not need to do feature selection with cross validation, but you can if you want to. Remember that we should not select variables based on p-values.

    • Summary output table: coefficient estimates, standard errors, p-values, confidence intervals

    • Model assessment: Which metric did you use to assess your model and why? Include at least one figure that relates to the model assessment.

  • Results: Interpret key results that relate to the research question. Include at least one figure to illustrate results. Note that this should be a descriptive figure related to the relationship of interest in the research question; this should not be a diagnostic plot.

  • Future work: What are the strengths and limitations of this analysis?

Peer Review

In class on Oct 19, you will evaluate a classmate’s documentation. Detailed instructions will be provided on that day. Part of your grade for this assignment is based on the quality of the feedback you give your classmate and the feedback you receive from your classmate.

Log in to canvas.duke.edu to submit your assignment for peer review. You can do this before class or at the beginning of class.

Submit the review of your peer’s assignment by Friday, Oct 20 11:59 PM.

Important!

You need to tell me ahead of time if you cannot be in class on Oct 19. Contact me by Oct 12 if you know you will not be in class that day so that I can make alternative arrangements for your peer review.

The peer review should be anonymous, but to do this, you will need to make sure your name is not shown in the pdf that you submit.

Submission & Formatting Instructions/Tips

You will submit two files to gradescope for the final submission on Oct 27: 1) a PDF that contains the documentation report, and 2) the qmd file you used to produce the reports

  • Your quarto document should be rendered directly to PDF, not to HTML and then saved as a PDF, and not with a different word processor.
  • Any plots must be appropriately formatted (axis labels, legends where appropriate)
  • The quarto website includes lots of helpful information for generating your report. I recommend that you use the visual editor to make formatting easier. The visual editor has several features that look like a generic word processor
  • PDF Basics
  • Gallery for advanced Quarto formatting (not necessary, but could be helpful)