plot(mtcars$mpg)
13 Outliers
Outliers are observations that fall outside the expected scope of the dataset. It’s important to identify outliers in your data and determine the necessary treatment for them before moving into the next analysis phase.
For example, it might be necessary to impute values, remove a row, perform sensitivity analysis, or choose analysis methods that are robust in the presence of outliers.
13.1 Finding Outliers Visually
One common first step many people employ when looking for outliers is visualizing their datasets so that extreme values can quickly be spotted This section will briefly cover several common visualizations used to identify outliers; however, each of these plots will be explored more in-depth later in the book.
13.1.1 Scatter Plot
This is probably the first plot you’ll reach for when trying to visualize outliers in your data. The scatter plot is a great tool to quickly visualize your data at a high level and see if anything major stands out.
Here’s how a scatter plot with an extreme outlier might look.
<- c(1,4,7,9,2,6,3,99,4,2,7,8)
data plot(data)
13.1.2 Box Plot
Another way to quickly visualize outliers is to use the “boxplot” function. This plot will allow you to evaluate outliers in a more systematic way.
boxplot(mtcars$mpg)
The solid black line represents the median value of your dataset. The top and bottom “whiskers” represent your extreme values (minimum and maximum). The top and bottom of the “box” represent the first and third quartile.
Here’s an example of a box plot with an extreme outlier.
boxplot(data)
13.1.3 Histogram
Histograms will allow you to see how often values occur within certain buckets.
hist(mtcars$mpg)
Here’s a histogram with data that contains an outlier.
hist(data)
13.1.4 Density Plot
Density plots can be thought of as a smoothed version of a histogram. (You can tune the degree of smoothing, e.g. via the adjust
argument to the density()
function.)
plot(density(mtcars$mpg))
Here’s an example of a density plot with data that contains an outlier.
plot(density(data))
13.2 Finding Outliers Statistically
While examining your data visually may be a convenient and sufficient way to detect outliers in your data, sometimes you may require a more rigorous approach to outlier detection.
13.2.1 Standard Deviation
One simple way to check the extremity of your observation is to calculate how many standard deviations it falls from the mean.
Let’s start by calculating the standard deviation of our dataset by using the “sd” function.
<- sd(data)
sd print(sd)
[1] 27.31078
Next, let’s calculate the mean of our dataset.
<- mean(data)
mean print(mean)
[1] 12.66667
Finally, for each record in our vector, let’s calculate how many standard deviations it falls from the mean.
<- abs(data - mean) / sd
extremity print(extremity)
[1] 0.4271817 0.3173350 0.2074883 0.1342571 0.3905661 0.2441038 0.3539506
[8] 3.1611447 0.3173350 0.3905661 0.2074883 0.1708727
13.3 Removing Outliers
After identifying your outliers you have several options to remove them.
Your first option would be to manually remove a specific outlier.
<- data[data != 99]
manually_cleaned print(manually_cleaned)
[1] 1 4 7 9 2 6 3 4 2 7 8
A more robust option would be to rely on your previously performed calculations to remove any observations which are located too far away from the mean.
<- data[extremity < 3]
statistically_cleaned print(statistically_cleaned)
[1] 1 4 7 9 2 6 3 4 2 7 8
13.4 Resources
“Statistics - Standard Deviation” by W3 Schools: https://www.w3schools.com/statistics/statistics_standard_deviation.php “Identifying outliers with the 1.5xIQR rule”: https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule