<- head(iris)
df print(df)
11 Data Cleaning
This chapter will cover the basics of cleaning your data including renaming variables, splitting text, replacing values, dropping columns, and dropping rows. These basic actions will be essential to preparing your data prior to developing insights.
11.1 Renaming Variables
Let’s begin by creating a dataset we can use to work through some examples. In our case, we’ll take the first few rows from the “iris” dataset and create a new dataframe called “df”.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
Now, let’s change our column names (which contain different properties of iris species) into “snake case”, e.g. all words are lowercase and separated by underscores. We’ll do this through the use of the “colnames” function. In the following example, we are renaming each column individually by specifying what number column to adjust.
colnames(df)[1] <- "sepal_length"
colnames(df)[2] <- "sepal_width"
colnames(df)[3] <- "petal_length"
colnames(df)[4] <- "petal_width"
colnames(df)[5] <- "species"
sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
Let’s change the column names again, but use “camel case” this time, e.g. the first word will be lowercase, and all subsequent words will have the first letter capitalized. Instead of using the column number though, this time we’ll use the actual name of the column we want to adjust.
colnames(df)[colnames(df) == "sepal_length"] <- "sepalLength"
colnames(df)[colnames(df) == "sepal_width"] <- "sepalWidth"
colnames(df)[colnames(df) == "petal_length"] <- "petalLength"
colnames(df)[colnames(df) == "petal_width"] <- "petalWidth"
sepalLength | sepalWidth | petalLength | petalWidth | species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
Alternatively, you can use the “rename” function from the “dplyr” package.
library(dplyr)
<- rename(df, "plantSpecies" = "species") df
sepalLength | sepalWidth | petalLength | petalWidth | plantSpecies |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
11.2 Splitting Text
If you’ve worked in a spreadsheet application before, you’re likely familiar with the “text-to-columns” tool. This tool allows you to split one column of data into multiple columns based on a delimiter. This same functionality is also achievable in R through functions such as the “separate” function from the “tidyr” library.
To test this function out, let’s first attach the “tidyr” package and then create a test data frame for us to use.
library(tidyr)
<- data.frame(person = c("John_Doe", "Jane_Doe")) df
person |
---|
John_Doe |
Jane_Doe |
We now have a data frame with one column that contains a first name and a last name combined by an underscore. Let’s now split the two names into their own separate columns.
<- df %>% separate(person, c("first_name", "last_name"), "_") df
first_name | last_name |
---|---|
John | Doe |
Jane | Doe |
Let’s break down what just happened. We first declared that “df” was going to be equal to the output of the function that followed by typing “df <-”. Next we told the separate function that it would be altering the existing dataframe called “df” by typing “df %>%”.
We then gave the separate function three arguments. The first argument was the column we were going to be editing, “person”. The second argument was the names of our two new columns, “first_name” and “last_name”. Finally, the third argument was our desired delimiter, “_“.
11.3 Replace Values
We’ll next go over how you can replace specific values in a dataset. Let’s begin by creating a dataset to work with. The following example will create a dataframe which contains student names and their respective grades on a test.
<- c("John", "Jane", "Joe", "Janet")
students <- c(83, 97, 74, 27)
grades <- data.frame(student = students, grade = grades) df
student | grade |
---|---|
John | 83 |
Jane | 97 |
Joe | 74 |
Janet | 27 |
Now that our dataset is assembled, let’s decide that we’re going to institute a minimum grade of 60. To do this we’re going to need to replace any grade lower than 60 with 60. The following example demonstrates one way you could accomplish that.
which(df$"grade" < 60), "grade"] <- 60 df[
student | grade |
---|---|
John | 83 |
Jane | 97 |
Joe | 74 |
Janet | 60 |
11.4 Drop Columns
Let’s use the “mtcars” dataset to demonstrate how to drop columns
<- head(mtcars)
df print(df)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Next, we can either drop columns by specifying the columns we want to keep or by specifying the ones we want to drop. The following example will get rid of the “carb” column by specifying that we want to keep every other column.
<- subset(df, select = c(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear)) df
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | |
---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 |
Alternatively, let’s try getting rid of the “gear” column directly. We can do this by putting a “-” in front of the “c” function.
<- subset(df, select = -c(gear)) df
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | |
---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 |
One other way you could drop columns if you wanted to use index numbers rather than column names is demonstrated below.
<- df[,-c(1,3:7)] df
cyl | vs | am | |
---|---|---|---|
Mazda RX4 | 6 | 0 | 1 |
Mazda RX4 Wag | 6 | 0 | 1 |
Datsun 710 | 4 | 1 | 1 |
Hornet 4 Drive | 6 | 1 | 0 |
Hornet Sportabout | 8 | 0 | 0 |
Valiant | 6 | 1 | 0 |
As you can see, we used the square brackets to select a subset of our dataframe and then pasted our values after the comma to declare that we were choosing columns rather than rows. After that we used the “-” symbol to say that we were choosing columns to drop rather than columns to keep. Finally, we chose to drop columns 1 as well as columns 3 through 7.
11.5 Drop Rows
We are also able to drop rows with the same method we just used to drop columns with the difference being that we would place our values in front of the comma rather than after the comma. For example, if we wanted to drop the first two rows (otherwise known as observations) from our previous dataframe, we could do the following.
<- df[-c(1:2),] df
cyl | vs | am | |
---|---|---|---|
Datsun 710 | 4 | 1 | 1 |
Hornet 4 Drive | 6 | 1 | 0 |
Hornet Sportabout | 8 | 0 | 0 |
Valiant | 6 | 1 | 0 |
11.6 Resources
- “Separate” function documentation: https://tidyr.tidyverse.org/reference/separate.html