<- c("John", "Jane", "")
blanks <- c(NA, "Jane", "Joe") nas
12 Handling Missing Data
You may encounter situations while analysing data that some of your data are missing. This chapter will cover best practices in regards to handling these situations as well as the technical details on how to remedy the data.
Missing data will often be represented by either “NA” or “” in R. Sometimes you will be able to manage by just ignoring this data; however, other times you will need to “impute” the missing data. This just means you end up coming up with a value that makes sense to use in place of the missing data. The three imputation methods we are going to cover in this chapter are constant value imputation, central tendency imputation, and multiple imputation.
12.1 Handling NA/Blank Values
This section will cover common methods and formulas for identifying and isolating missing data. Let’s start by creating a a vector with one “” value and a vector with one “NA” value.
print(blanks)
[1] "John" "Jane" ""
print(nas)
[1] NA "Jane" "Joe"
We can use the “is.na” function to identify data with “NA” values. The following example demonstrates how the function works. The output ends up being a “TRUE” or “FALSE” to designate whether each observation is an “NA” value.
is.na(nas)
[1] TRUE FALSE FALSE
We can then take this one step further and use the function to filter for “NA” values.
<- nas[is.na(nas)]
only_nas print(only_nas)
[1] NA
This works great; however, it’s more likely that you would want to see the values which aren’t equal to “NA”. This can be accomplished by using the “NOT” operator “!”.
<- nas[!is.na(nas)]
no_nas print(no_nas)
[1] "Jane" "Joe"
If your missing data is just an empty string (““) rather than an”NA” value, you can use simple comparison operators to accomplish the same thing.
== "" blanks
[1] FALSE FALSE TRUE
<- blanks[blanks == ""]
only_blanks print(only_blanks)
[1] ""
<- blanks[blanks != ""]
no_blanks print(no_blanks)
[1] "John" "Jane"
When working with dataframes rather than just vectors, you can also use the “na.omit” function to remove complete rows with “NA” values.
<- c("John", "Jane", "Joe")
students <- c(100, 80, NA)
scores <- data.frame(student = students, score = scores)
df print(df)
student score
1 John 100
2 Jane 80
3 Joe NA
<- na.omit(df)
df print(df)
student score
1 John 100
2 Jane 80
12.2 Constant Value Imputation
Many datasets you encounter will likely be missing data. The temptation may be to immediately disregard these observations; however, it’s important to consider what missing data represents in the context of your dataset as well as the context of what your analysis is hoping to achieve. For example, say you are a teacher and you are trying to determine the average test scores of your students. You have a dataset which lists your students names along with their respective test scores. However, you find that one of your students has an “NA” value in place of a test score.
<- c("John", "Jane", "Joe")
students <- c(100, 80, NA)
scores <- data.frame(student = students, score = scores)
df
print(df)
student score
1 John 100
2 Jane 80
3 Joe NA
Depending on the context, it may make sense for you to ignore this observation prior to calculating the average score. It could also make sense for you to assign a value of “0” to this student’s test score.
Let’s demonstrate how you would replace “NA” values with a constant value of “0”.
is.na(df)] <- 0
df[print(df)
student score
1 John 100
2 Jane 80
3 Joe 0
12.3 Central Tendency Imputation
Two of the most common measures of central tendency are “mean” and “median”. Suppose you have a dataset that tracks the time employees spend performing a certain task. After review, you realize that several employees have not historically tracked their time. Instead of just ignoring these entries, you decide to try imputing these values.
<- c("John", "Jane", "Joe", "Janet")
employees <- c(12, 14, NA, 9)
hours_spent <- data.frame(employee = employees, hours_spent = hours_spent)
df
print(df)
employee hours_spent
1 John 12
2 Jane 14
3 Joe NA
4 Janet 9
The following example demonstrates how you can replace missing values with an average of the rest of the employees’ time spent.
<- mean(df$hours_spent[!is.na(df$hours_spent)])
mean_value print(mean_value)
[1] 11.66667
$hours_spent[is.na(df$hours_spent)] <- mean_value
dfprint(df)
employee hours_spent
1 John 12.00000
2 Jane 14.00000
3 Joe 11.66667
4 Janet 9.00000
Alternatively, we can reset our dataframe and replace “NA” values with the median value by doing the following.
# RESET DATAFRAME
$hours_spent <- hours_spent
df
# SET MISSING VALUES TO MEDIAN
<- median(df$hours_spent[!is.na(df$hours_spent)])
median_value print(median_value)
[1] 12
$hours_spent[is.na(df$hours_spent)] <- median_value
dfprint(df)
employee hours_spent
1 John 12
2 Jane 14
3 Joe 12
4 Janet 9
12.4 Multiple Imputation
The two previous examples are types of “single value imputation” as both examples took one value and applied it to every missing value in the dataset. At a very basic level, multiple imputation requires users to come up with some sort of model to fill in missing values. In the following example we are going to demonstrate how you might use a simple linear regression model to perform multiple imputation.
We’ll begin by creating a dataframe with both an “x” and a “y” variable.
<- c(10, 8, NA, 9, 4, NA)
y <- c(8, 6, 9, 7, 2, 12)
x <- data.frame(y = y, x = x)
df
print(df)
y x
1 10 8
2 8 6
3 NA 9
4 9 7
5 4 2
6 NA 12
Next, let’s use the “lm” function to create a linear model and then print out a summary of that model.
<- lm(y ~ x)
model summary(model)
Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
Call:
lm(formula = y ~ x)
Residuals:
1 2 4 5
0 0 0 0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2 0 Inf <2e-16 ***
x 1 0 Inf <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0 on 2 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: Inf on 1 and 2 DF, p-value: < 2.2e-16
From the model summary, we can see that we have a model with a high level of statistical significance. Let’s now use the model coefficients to impute our missing values.
<- predict(model, newdata = list(x = df$x[is.na(df$y)]))
imputed $y[is.na(df$y)] <- imputed
dfprint(df)
y x
1 10 8
2 8 6
3 11 9
4 9 7
5 4 2
6 14 12
12.5 Resources
- “Missing-data Imputation” from Columbia: http://www.stat.columbia.edu/~gelman/arm/missing.pdf