Cleaning and exploring data in R

A crucial first step of data analysis is exploring the dataset (and then cleaning, as needed). Today, we will practice exploring and cleaning data in R.

Learning objectives

By the end of this session, students should be able to:

  • distinguish between the different types of variables

  • use the tidyverse functions to explore and clean data in R

  • use ggplot2 to visualize data

Note: If you are comfortable with “base R,” you can use that for data cleaning and exploration this semester. However, I encourage you to learn the tidyverse functions!

Survey

First, complete this survey. We’ll use this as our dataset for today.

Types of variables

Numeric variables take numerical values and it makes sense to perform calculations on the values (e.g., addition, mean)

  • Discrete variables can not take decimal values (e.g., Number of required statistics courses in a major)

  • Continuous variables can take decimal values (e.g., height in cm)

Categorical variables are variables that have categories, where each category is called a level

  • Nominal variables do not have an order (e.g., eye color)

  • Ordinal variables do have an order (e.g., education categories)

Exercise

In your group, discuss the following:

  1. Classify each of the survey questions as discrete, continuous, nominal, or ordinal.
  2. What does it mean to explore data?
  3. What does it mean to clean data? Identify how the survey data may need to be cleaned just by looking at the questions.

First steps in R

Hopefully you have already installed R/RStudio on your computer. If so, you can copy and paste the code below into your own script. If you haven’t yet installed R/RStudio, you can run code directly from this page.



Let’s remove the timestamp variable


Let’s make our variable names more concise (but still descriptive!)


Cleaning and exploring variables

Siblings

Let’s start with the Siblings variable. What information do we need to know to clean the variable?


Always always always:

  • create a new variable instead of overwriting the original

  • perform a quality control check


Now that the variable is clean, let’s explore it more:



Sushi

Now consider the sushi variable.


It’s important to pay attention to implausible values. What would be an implausible value in this case? How should we handle it?


Now let’s generate a plot to visualize the sushi variable


Languages


Often, we need to combine categories if we have too few observations in multiple categories. How should we combine categories in this case?

Exercises

In your group, complete the following:

  1. For the application area of interest variable, how many students responded with “other”? What does this say about the survey design?
  2. Explore, clean, and visualize the remaining variables in the dataset. Note that you may have to look up some functions to help. For example, the “substr” function will be useful for the course excitement variable.