11  Data Cleaning

This chapter will cover the basics of cleaning your data including renaming variables, splitting text, replacing values, dropping columns, and dropping rows. These basic actions will be essential to preparing your data prior to developing insights.

11.1 Renaming Variables

Let’s begin by creating a dataset we can use to work through some examples. In our case, we’ll take the first few rows from the “iris” dataset and create a new dataframe called “df”.

df <- head(iris)
print(df)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Now, let’s change our column names (which contain different properties of iris species) into “snake case”, e.g. all words are lowercase and separated by underscores. We’ll do this through the use of the “colnames” function. In the following example, we are renaming each column individually by specifying what number column to adjust.

colnames(df)[1] <- "sepal_length"
colnames(df)[2] <- "sepal_width"
colnames(df)[3] <- "petal_length"
colnames(df)[4] <- "petal_width"
colnames(df)[5] <- "species"
sepal_length sepal_width petal_length petal_width species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Let’s change the column names again, but use “camel case” this time, e.g. the first word will be lowercase, and all subsequent words will have the first letter capitalized. Instead of using the column number though, this time we’ll use the actual name of the column we want to adjust.

colnames(df)[colnames(df) == "sepal_length"] <- "sepalLength"
colnames(df)[colnames(df) == "sepal_width"] <- "sepalWidth"
colnames(df)[colnames(df) == "petal_length"] <- "petalLength"
colnames(df)[colnames(df) == "petal_width"] <- "petalWidth"
sepalLength sepalWidth petalLength petalWidth species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Alternatively, you can use the “rename” function from the “dplyr” package.

library(dplyr)
df <- rename(df, "plantSpecies" = "species")
sepalLength sepalWidth petalLength petalWidth plantSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

11.2 Splitting Text

If you’ve worked in a spreadsheet application before, you’re likely familiar with the “text-to-columns” tool. This tool allows you to split one column of data into multiple columns based on a delimiter. This same functionality is also achievable in R through functions such as the “separate” function from the “tidyr” library.

To test this function out, let’s first attach the “tidyr” package and then create a test data frame for us to use.

library(tidyr)
df <- data.frame(person = c("John_Doe", "Jane_Doe"))
person
John_Doe
Jane_Doe

We now have a data frame with one column that contains a first name and a last name combined by an underscore. Let’s now split the two names into their own separate columns.

df <- df %>% separate(person, c("first_name", "last_name"), "_")
first_name last_name
John Doe
Jane Doe

Let’s break down what just happened. We first declared that “df” was going to be equal to the output of the function that followed by typing “df <-”. Next we told the separate function that it would be altering the existing dataframe called “df” by typing “df %>%”.

We then gave the separate function three arguments. The first argument was the column we were going to be editing, “person”. The second argument was the names of our two new columns, “first_name” and “last_name”. Finally, the third argument was our desired delimiter, “_“.

11.3 Replace Values

We’ll next go over how you can replace specific values in a dataset. Let’s begin by creating a dataset to work with. The following example will create a dataframe which contains student names and their respective grades on a test.

students <- c("John", "Jane", "Joe", "Janet")
grades <- c(83, 97, 74, 27)
df <- data.frame(student = students, grade = grades)
student grade
John 83
Jane 97
Joe 74
Janet 27

Now that our dataset is assembled, let’s decide that we’re going to institute a minimum grade of 60. To do this we’re going to need to replace any grade lower than 60 with 60. The following example demonstrates one way you could accomplish that.

df[which(df$"grade" < 60), "grade"] <- 60
student grade
John 83
Jane 97
Joe 74
Janet 60

11.4 Drop Columns

Let’s use the “mtcars” dataset to demonstrate how to drop columns

df <- head(mtcars)
print(df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Next, we can either drop columns by specifying the columns we want to keep or by specifying the ones we want to drop. The following example will get rid of the “carb” column by specifying that we want to keep every other column.

df <- subset(df, select = c(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear))
mpg cyl disp hp drat wt qsec vs am gear
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3

Alternatively, let’s try getting rid of the “gear” column directly. We can do this by putting a “-” in front of the “c” function.

df <- subset(df, select = -c(gear))
mpg cyl disp hp drat wt qsec vs am
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0

One other way you could drop columns if you wanted to use index numbers rather than column names is demonstrated below.

df <- df[,-c(1,3:7)]
cyl vs am
Mazda RX4 6 0 1
Mazda RX4 Wag 6 0 1
Datsun 710 4 1 1
Hornet 4 Drive 6 1 0
Hornet Sportabout 8 0 0
Valiant 6 1 0

As you can see, we used the square brackets to select a subset of our dataframe and then pasted our values after the comma to declare that we were choosing columns rather than rows. After that we used the “-” symbol to say that we were choosing columns to drop rather than columns to keep. Finally, we chose to drop columns 1 as well as columns 3 through 7.

11.5 Drop Rows

We are also able to drop rows with the same method we just used to drop columns with the difference being that we would place our values in front of the comma rather than after the comma. For example, if we wanted to drop the first two rows (otherwise known as observations) from our previous dataframe, we could do the following.

df <- df[-c(1:2),]
cyl vs am
Datsun 710 4 1 1
Hornet 4 Drive 6 1 0
Hornet Sportabout 8 0 0
Valiant 6 1 0

11.6 Resources