Cleaning and exploring data in R

A crucial first step of data analysis is exploring the dataset (and then cleaning, as needed). Today, we will practice exploring and cleaning data in R.

Learning objectives

By the end of this session, students should be able to:

  • distinguish between the different types of variables

  • use the tidyverse functions to explore and clean data in R

  • use ggplot2 to visualize data

Note: If you are comfortable with “base R,” you can use that for data cleaning and exploration this semester. However, I encourage you to learn the tidyverse functions!

Types of variables in our data

Numeric variables take numerical values and it makes sense to perform calculations on the values (e.g., addition, mean)

  • Discrete variables can not take decimal values (e.g., Number of required statistics courses in a major)

  • Continuous variables can take decimal values (e.g., height in cm)

Categorical variables are variables that have categories, where each category is called a level

  • Nominal variables do not have an order (e.g., eye color)

  • Ordinal variables do have an order (e.g., education categories)

Types of variables in R

  • Numeric/double: a number that can take decimal values

  • Integer: a number that takes integer values

  • Character: a string (denoted by quotes)

  • Logical: TRUE/FALSE

Survey

First, complete this survey. We’ll use this as our dataset for some practice exercises.

Exercise

In your group, discuss the following:

  1. Classify each of the survey questions as discrete, continuous, nominal, or ordinal.
  2. Can you identify how each variable will be stored in R before you read in the data?
  3. What does it mean to explore data?
  4. What does it mean to clean data? Identify how the survey data may need to be cleaned just by looking at the questions.

First steps in R

Hopefully you have already installed R/RStudio on your computer. If so, you can copy and paste the code below into your own script. If you haven’t yet installed R/RStudio, you can run code directly from this page.

Let’s remove the timestamp variable

Note: The select function will create a subset by columns, while the filter function will create a subset by rows

Let’s make our variable names more concise (but still descriptive!). Note that R is case-sensitive, so the variable name “Interest” is distinct from “interest”

Cleaning and exploring variables

Siblings (changing a particular value)

Let’s start with the Siblings variable. What information do we need to know to clean the variable?

Always always always:

  • create a new variable instead of overwriting the original

  • perform a quality control check

Here, we were able to find out that the value of “-1” should be 3. We often don’t have access to the original data source, so we would have to make this a missing value. It is always dangerous to make assumptions during the data cleaning process!

Languages (Creating categories)

Often, we need to combine categories if we have too few observations in multiple categories. Let’s say we want to create a new numeric variable that collapses the Languages variable into 3 categories using the following values:

0: 0 languages

1: 1-3 languages

2: 4 or more languages

First, we need to create a numeric version of the Languages variable since we have a value of “10 or more”

We can perform a quality control check by computing the min and max of the numeric variable for each level of the categorical variable. If the minimum and maximum fit within the desired range for each level, then we have created it correctly.

Sushi (outliers/implausible values)

Now consider the sushi variable. Based on our glimpse output, we have integer values. Let’s check the minimum and maximum values.

It’s important to pay attention to implausible values. What would be an implausible value in this case? How should we handle it?

Values below 0 are implausible. 999 is also likely implausible because the respondent would have had to eat sushi more than 33 times each day, on average. Note, however, that values above 30 are not necessarily implausible because the question just asked how many times you’ve eaten sushi, not on how many days you ate sushi. The value 14 could be an outlier, but it is not implausible. We do not remove values just because they are outliers.

For the quality control check, let’s use the summary function. This shows us that the minimum is 0, the maximum is 14, and there are two missing values. This is a base R function, so we need to use the $ symbol to designate the variable instead of using the pipe operator |>

Exercises

In your group, complete the following (one of these is an impossible task! Can you identify which one?):

  1. For the application area of interest variable, how many students responded with “other”? What does this say about the survey design?

20 students responded “Other.” Popular areas of interest including finance and tech were missing, which is poor survey design.

  1. What proportion of the class plays an instrument?

    I meant to show this code in class - my bad! The count function gives us the unique values and the number of respondents for each value. It denotes the number automatically as n, so we can layer a mutate function to calculate the proportion using n/sum(n) and call it prop. Here we see that 45% of the class plays an instrument.

  1. Convert the Age variable to a continuous variable so that we have more information about specific ages.

This is the impossible task! We cannot convert a categorical variable into a continuous variable. Here, we do not know the true ages, we only know the range.

  1. Create a stacked histogram of the Sushi variable for those who would like to live on the beach vs those who would like to live in a cabin in the woods. Change the color of the bars to a new color of your choice. Add a descriptive title and axis labels.
  1. How can we visualize the relationship between application area of interest and course excitement?

We could do a stacked bar chart:

  1. Create a new variable called “Excitement_num” that stores only the numeric value of the Excitement variable (e.g., 1 instead of “1 (I am dreading this)”). Instead of typing out each answer choice, use the “substr” and “as.numeric” functions. You can learn how the substr function works by typing ?substr in the console.