15  Summary Statistics

Summary statistics (otherwise known as descriptive statistics) are usually where one starts when beginning to develop insights. You may hear the phrase “Exploratory Data Analysis” (sometimes abbreviated “EDA”) throughout your career. This is the point where you try to get a high-level understanding of the distributions and relationships within your dataset.

15.1 Quantitative Data

When dealing with continuous data, one of the quickest ways to get a high level view of your data is by using the “summary” function. This function will return your extreme (minimum and maximum) values, your median, mean, 1st quantile, and 3rd quantile.

summary(mtcars$mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 

Alternatively, you can use the following eight functions to retrieve specific information about your data.

# Returns the average
mean(mtcars$mpg)
[1] 20.09062
# Returns the median
median(mtcars$mpg)
[1] 19.2
# Returns the standard deviation
sd(mtcars$mpg)
[1] 6.026948
# Returns the sample variance
var(mtcars$mpg)
[1] 36.3241
# Returns the minimum value
min(mtcars$mpg)
[1] 10.4
# Returns the maximum value
max(mtcars$mpg)
[1] 33.9
# Returns the minimum and maximum value
range(mtcars$mpg)
[1] 10.4 33.9
# Returns quantile data
quantile(mtcars$mpg)
    0%    25%    50%    75%   100% 
10.400 15.425 19.200 22.800 33.900 

15.2 Qualitative Data

If you’re working with data that is categorical and encoded as a factor, you can view all categories by using the “levels” function.

levels(iris$Species)
[1] "setosa"     "versicolor" "virginica" 

However, if you want to count the number of occurrences for each level, you can use the “table” function.

table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 

If you need to keep digging for insights, you can represent your categories however you’d like to using the “group_by” function covered in the last chapter.

15.3 Resources