11 Data Cleaning

This chapter will cover the basics of cleaning your data including renaming variables, splitting text, replacing values, dropping columns, and dropping rows. These basic actions will be essential to preparing your data prior to developing insights.

11.1 Renaming Variables

Let’s begin by creating a dataset we can use to work through some examples. In our case, we’ll take the first few rows from the “iris” dataset and create a new dataframe called “df”.

df <- head(iris)
print(df)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

Now, let’s change our column names (which contain different properties of iris species) into “snake case”, e.g. all words are lowercase and separated by underscores. We’ll do this through the use of the “colnames” function. In the following example, we are renaming each column individually by specifying what number column to adjust.

colnames(df)[1] <- "sepal_length"
colnames(df)[2] <- "sepal_width"
colnames(df)[3] <- "petal_length"
colnames(df)[4] <- "petal_width"
colnames(df)[5] <- "species"

sepal_length	sepal_width	petal_length	petal_width	species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

Let’s change the column names again, but use “camel case” this time, e.g. the first word will be lowercase, and all subsequent words will have the first letter capitalized. Instead of using the column number though, this time we’ll use the actual name of the column we want to adjust.

colnames(df)[colnames(df) == "sepal_length"] <- "sepalLength"
colnames(df)[colnames(df) == "sepal_width"] <- "sepalWidth"
colnames(df)[colnames(df) == "petal_length"] <- "petalLength"
colnames(df)[colnames(df) == "petal_width"] <- "petalWidth"

sepalLength	sepalWidth	petalLength	petalWidth	species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

Alternatively, you can use the “rename” function from the “dplyr” package.

library(dplyr)
df <- rename(df, "plantSpecies" = "species")

sepalLength	sepalWidth	petalLength	petalWidth	plantSpecies
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

11.2 Splitting Text

If you’ve worked in a spreadsheet application before, you’re likely familiar with the “text-to-columns” tool. This tool allows you to split one column of data into multiple columns based on a delimiter. This same functionality is also achievable in R through functions such as the “separate” function from the “tidyr” library.

To test this function out, let’s first attach the “tidyr” package and then create a test data frame for us to use.

library(tidyr)
df <- data.frame(person = c("John_Doe", "Jane_Doe"))

person
John_Doe
Jane_Doe

We now have a data frame with one column that contains a first name and a last name combined by an underscore. Let’s now split the two names into their own separate columns.

df <- df %>% separate(person, c("first_name", "last_name"), "_")

first_name	last_name
John	Doe
Jane	Doe

Let’s break down what just happened. We first declared that “df” was going to be equal to the output of the function that followed by typing “df <-”. Next we told the separate function that it would be altering the existing dataframe called “df” by typing “df %>%”.

We then gave the separate function three arguments. The first argument was the column we were going to be editing, “person”. The second argument was the names of our two new columns, “first_name” and “last_name”. Finally, the third argument was our desired delimiter, “_“.

11.3 Replace Values

We’ll next go over how you can replace specific values in a dataset. Let’s begin by creating a dataset to work with. The following example will create a dataframe which contains student names and their respective grades on a test.

students <- c("John", "Jane", "Joe", "Janet")
grades <- c(83, 97, 74, 27)
df <- data.frame(student = students, grade = grades)

student	grade
John	83
Jane	97
Joe	74
Janet	27

Now that our dataset is assembled, let’s decide that we’re going to institute a minimum grade of 60. To do this we’re going to need to replace any grade lower than 60 with 60. The following example demonstrates one way you could accomplish that.

df[which(df$"grade" < 60), "grade"] <- 60

student	grade
John	83
Jane	97
Joe	74
Janet	60

11.4 Drop Columns

Let’s use the “mtcars” dataset to demonstrate how to drop columns

df <- head(mtcars)
print(df)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

Next, we can either drop columns by specifying the columns we want to keep or by specifying the ones we want to drop. The following example will get rid of the “carb” column by specifying that we want to keep every other column.

df <- subset(df, select = c(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear))

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3

Alternatively, let’s try getting rid of the “gear” column directly. We can do this by putting a “-” in front of the “c” function.

df <- subset(df, select = -c(gear))

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0

One other way you could drop columns if you wanted to use index numbers rather than column names is demonstrated below.

df <- df[,-c(1,3:7)]

	cyl	vs	am
Mazda RX4	6	0	1
Mazda RX4 Wag	6	0	1
Datsun 710	4	1	1
Hornet 4 Drive	6	1	0
Hornet Sportabout	8	0	0
Valiant	6	1	0

As you can see, we used the square brackets to select a subset of our dataframe and then pasted our values after the comma to declare that we were choosing columns rather than rows. After that we used the “-” symbol to say that we were choosing columns to drop rather than columns to keep. Finally, we chose to drop columns 1 as well as columns 3 through 7.

11.5 Drop Rows

We are also able to drop rows with the same method we just used to drop columns with the difference being that we would place our values in front of the comma rather than after the comma. For example, if we wanted to drop the first two rows (otherwise known as observations) from our previous dataframe, we could do the following.

df <- df[-c(1:2),]

	cyl	vs	am
Datsun 710	4	1	1
Hornet 4 Drive	6	1	0
Hornet Sportabout	8	0	0
Valiant	6	1	0

11.6 Resources

“Separate” function documentation: https://tidyr.tidyverse.org/reference/separate.html