14  Organizing Data

This chapter will focus on sorting, filtering, and grouping your datasets.

14.1 Sort, Order, and Rank

Three functions you may use to organize your data are “sort”, “order”, and “rank”. The following examples will go through each one and show you how to use them.

Let’s start by creating a vector to work with.

completed_tasks <- c(5, 9, 3, 2, 7)
print(completed_tasks)
[1] 5 9 3 2 7

Next we’ll sort our data by using the “sort” function. This function will return your original data but sorted in ascending order.

sort(completed_tasks)
[1] 2 3 5 7 9

Alternatively, you can set the “decreasing” parameter to “TRUE” to sort your data in descending order.

sort(completed_tasks, decreasing = TRUE)
[1] 9 7 5 3 2

The “order” function will return the index of each item in your vector in sorted order. This function also has a “decreasing” parameter which can be set to “TRUE”.

order(completed_tasks)
[1] 4 3 1 5 2

Finally, the “rank” function will return the rank of each item in your vector in ascending order.

rank(completed_tasks)
[1] 3 5 2 1 4

14.2 Filtering

You may have noticed in previous chapters that we’ve used comparison operators to filter our data. Let’s review by filtering out completed tasks greater than or equal to 7.

completed_tasks[completed_tasks < 7]
[1] 5 3 2

Alternatively, you can use the “filter” function from the “dplyr” library. Let’s use this function with the “iris” dataset to filter out any species other than virginica.

head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
library(dplyr)
virginica <- filter(iris, Species == "virginica")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica

14.3 Grouping

One final resource for you to leverage as you organize your data is the “group_by” function from the “dplyr” library.

If we wanted to group the iris dataset by species we might do something similar to the following example.

library(dplyr)
grouped_species <- iris %>% group_by(Species)

Now if we print out our resulting dataset you’ll notice that the “group_by” operation we just performed doesn’t change how the data looks by itself.

head(grouped_species)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

In order to change the structure of our dataset we’ll need to specify how our groups should be treated by combining the “group_by” function with another dplyr “verb” such as “summarise”.

grouped_species <- grouped_species %>% summarise(
    sepal_length = mean(Sepal.Length),
    sepal_width = mean(Sepal.Width),
    petal_length = mean(Petal.Length),
    petal_width = mean(Petal.Width)
)
head(grouped_species)
Species sepal_length sepal_width petal_length petal_width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

Now each of the three species in the iris dataset have their average sepal length, sepal width, petal length, and petal width displayed.

You can find more information about the “group_by” function and other dplyr “verbs” in the resources section below.

14.4 Resources