<- c(5, 9, 3, 2, 7)
completed_tasks print(completed_tasks)
[1] 5 9 3 2 7
This chapter will focus on sorting, filtering, and grouping your datasets.
Three functions you may use to organize your data are “sort”, “order”, and “rank”. The following examples will go through each one and show you how to use them.
Let’s start by creating a vector to work with.
<- c(5, 9, 3, 2, 7)
completed_tasks print(completed_tasks)
[1] 5 9 3 2 7
Next we’ll sort our data by using the “sort” function. This function will return your original data but sorted in ascending order.
sort(completed_tasks)
[1] 2 3 5 7 9
Alternatively, you can set the “decreasing” parameter to “TRUE” to sort your data in descending order.
sort(completed_tasks, decreasing = TRUE)
[1] 9 7 5 3 2
The “order” function will return the index of each item in your vector in sorted order. This function also has a “decreasing” parameter which can be set to “TRUE”.
order(completed_tasks)
[1] 4 3 1 5 2
Finally, the “rank” function will return the rank of each item in your vector in ascending order.
rank(completed_tasks)
[1] 3 5 2 1 4
You may have noticed in previous chapters that we’ve used comparison operators to filter our data. Let’s review by filtering out completed tasks greater than or equal to 7.
< 7] completed_tasks[completed_tasks
[1] 5 3 2
Alternatively, you can use the “filter” function from the “dplyr” library. Let’s use this function with the “iris” dataset to filter out any species other than virginica.
head(iris)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
library(dplyr)
<- filter(iris, Species == "virginica") virginica
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
6.3 | 3.3 | 6.0 | 2.5 | virginica |
5.8 | 2.7 | 5.1 | 1.9 | virginica |
7.1 | 3.0 | 5.9 | 2.1 | virginica |
6.3 | 2.9 | 5.6 | 1.8 | virginica |
6.5 | 3.0 | 5.8 | 2.2 | virginica |
7.6 | 3.0 | 6.6 | 2.1 | virginica |
One final resource for you to leverage as you organize your data is the “group_by” function from the “dplyr” library.
If we wanted to group the iris dataset by species we might do something similar to the following example.
library(dplyr)
<- iris %>% group_by(Species) grouped_species
Now if we print out our resulting dataset you’ll notice that the “group_by” operation we just performed doesn’t change how the data looks by itself.
head(grouped_species)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
In order to change the structure of our dataset we’ll need to specify how our groups should be treated by combining the “group_by” function with another dplyr “verb” such as “summarise”.
<- grouped_species %>% summarise(
grouped_species sepal_length = mean(Sepal.Length),
sepal_width = mean(Sepal.Width),
petal_length = mean(Petal.Length),
petal_width = mean(Petal.Width)
)
head(grouped_species)
Species | sepal_length | sepal_width | petal_length | petal_width |
---|---|---|---|---|
setosa | 5.006 | 3.428 | 1.462 | 0.246 |
versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
virginica | 6.588 | 2.974 | 5.552 | 2.026 |
Now each of the three species in the iris dataset have their average sepal length, sepal width, petal length, and petal width displayed.
You can find more information about the “group_by” function and other dplyr “verbs” in the resources section below.