12  Handling Missing Data

You may encounter situations while analysing data that some of your data are missing. This chapter will cover best practices in regards to handling these situations as well as the technical details on how to remedy the data.

Missing data will often be represented by either “NA” or “” in R. Sometimes you will be able to manage by just ignoring this data; however, other times you will need to “impute” the missing data. This just means you end up coming up with a value that makes sense to use in place of the missing data. The three imputation methods we are going to cover in this chapter are constant value imputation, central tendency imputation, and multiple imputation.

12.1 Handling NA/Blank Values

This section will cover common methods and formulas for identifying and isolating missing data. Let’s start by creating a a vector with one “” value and a vector with one “NA” value.

blanks <- c("John", "Jane", "")
nas <- c(NA, "Jane", "Joe")
print(blanks)
[1] "John" "Jane" ""    
print(nas)
[1] NA     "Jane" "Joe" 

We can use the “is.na” function to identify data with “NA” values. The following example demonstrates how the function works. The output ends up being a “TRUE” or “FALSE” to designate whether each observation is an “NA” value.

is.na(nas)
[1]  TRUE FALSE FALSE

We can then take this one step further and use the function to filter for “NA” values.

only_nas <- nas[is.na(nas)]
print(only_nas)
[1] NA

This works great; however, it’s more likely that you would want to see the values which aren’t equal to “NA”. This can be accomplished by using the “NOT” operator “!”.

no_nas <- nas[!is.na(nas)]
print(no_nas)
[1] "Jane" "Joe" 

If your missing data is just an empty string (““) rather than an”NA” value, you can use simple comparison operators to accomplish the same thing.

blanks == ""
[1] FALSE FALSE  TRUE
only_blanks <- blanks[blanks == ""]
print(only_blanks)
[1] ""
no_blanks <- blanks[blanks != ""]
print(no_blanks)
[1] "John" "Jane"

When working with dataframes rather than just vectors, you can also use the “na.omit” function to remove complete rows with “NA” values.

students <- c("John", "Jane", "Joe")
scores <- c(100, 80, NA)
df <- data.frame(student = students, score = scores)
print(df)
  student score
1    John   100
2    Jane    80
3     Joe    NA
df <- na.omit(df)
print(df)
  student score
1    John   100
2    Jane    80

12.2 Constant Value Imputation

Many datasets you encounter will likely be missing data. The temptation may be to immediately disregard these observations; however, it’s important to consider what missing data represents in the context of your dataset as well as the context of what your analysis is hoping to achieve. For example, say you are a teacher and you are trying to determine the average test scores of your students. You have a dataset which lists your students names along with their respective test scores. However, you find that one of your students has an “NA” value in place of a test score.

students <- c("John", "Jane", "Joe")
scores <- c(100, 80, NA)
df <- data.frame(student = students, score = scores)

print(df)
  student score
1    John   100
2    Jane    80
3     Joe    NA

Depending on the context, it may make sense for you to ignore this observation prior to calculating the average score. It could also make sense for you to assign a value of “0” to this student’s test score.

Let’s demonstrate how you would replace “NA” values with a constant value of “0”.

df[is.na(df)] <- 0
print(df)
  student score
1    John   100
2    Jane    80
3     Joe     0

12.3 Central Tendency Imputation

Two of the most common measures of central tendency are “mean” and “median”. Suppose you have a dataset that tracks the time employees spend performing a certain task. After review, you realize that several employees have not historically tracked their time. Instead of just ignoring these entries, you decide to try imputing these values.

employees <- c("John", "Jane", "Joe", "Janet")
hours_spent <- c(12, 14, NA, 9)
df <- data.frame(employee = employees, hours_spent = hours_spent)

print(df)
  employee hours_spent
1     John          12
2     Jane          14
3      Joe          NA
4    Janet           9

The following example demonstrates how you can replace missing values with an average of the rest of the employees’ time spent.

mean_value <- mean(df$hours_spent[!is.na(df$hours_spent)])
print(mean_value)
[1] 11.66667
df$hours_spent[is.na(df$hours_spent)] <- mean_value
print(df)
  employee hours_spent
1     John    12.00000
2     Jane    14.00000
3      Joe    11.66667
4    Janet     9.00000

Alternatively, we can reset our dataframe and replace “NA” values with the median value by doing the following.

# RESET DATAFRAME
df$hours_spent <- hours_spent

# SET MISSING VALUES TO MEDIAN
median_value <- median(df$hours_spent[!is.na(df$hours_spent)])
print(median_value)
[1] 12
df$hours_spent[is.na(df$hours_spent)] <- median_value
print(df)
  employee hours_spent
1     John          12
2     Jane          14
3      Joe          12
4    Janet           9

12.4 Multiple Imputation

The two previous examples are types of “single value imputation” as both examples took one value and applied it to every missing value in the dataset. At a very basic level, multiple imputation requires users to come up with some sort of model to fill in missing values. In the following example we are going to demonstrate how you might use a simple linear regression model to perform multiple imputation.

Note

Linear regression is covered more in-depth later in this book. Don’t worry if this example feels completely unfamiliar at this point.

We’ll begin by creating a dataframe with both an “x” and a “y” variable.

y <- c(10, 8, NA, 9, 4, NA)
x <- c(8, 6, 9, 7, 2, 12)
df <- data.frame(y = y, x = x)

print(df)
   y  x
1 10  8
2  8  6
3 NA  9
4  9  7
5  4  2
6 NA 12

Next, let’s use the “lm” function to create a linear model and then print out a summary of that model.

model <- lm(y ~ x)
summary(model)
Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

Call:
lm(formula = y ~ x)

Residuals:
1 2 4 5 
0 0 0 0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2          0     Inf   <2e-16 ***
x                  1          0     Inf   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0 on 2 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic:   Inf on 1 and 2 DF,  p-value: < 2.2e-16

From the model summary, we can see that we have a model with a high level of statistical significance. Let’s now use the model coefficients to impute our missing values.

imputed <- predict(model, newdata = list(x = df$x[is.na(df$y)]))
df$y[is.na(df$y)] <- imputed
print(df)
   y  x
1 10  8
2  8  6
3 11  9
4  9  7
5  4  2
6 14 12

12.5 Resources