Chapter 8 Data Wrangling with dplyr
Packages Required:
* dplyr - a package for working with spatial data
8.1 A Short Introduction to dplyr
In the last two chapters, we introduced several R functions that can be used to work with data. These included functions such as unique
, names
, str
, summary
, aggregate
, and others. These are “base” R functions, and knowing a handful of common functions will serve you well.
Another set of functions for data wrangling are contained in the package, dplyr. This package is part of the tidyverse. This chapter will be a very, very short introduction to dplyr and its coding structure (which looks different than the coding structure we’ve used up to this point).
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
The dplyr package has its own website, and a full tutorial is provided in Chapter 5 of the ebook, R for Data Science by Garrett Grolemund and Hadley Wickham.
Here’s a snip from the dplyr website:
"dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
- mutate() adds new variables that are functions of existing variables
- select() picks variables based on their names.
- filter() picks cases based on their values.
- summarise() reduces multiple values down to a single summary.
- arrange() changes the ordering of the rows.
These all combine naturally with group_by() which allows you to perform any operation “by group”. You can learn more about them in vignette(“dplyr”). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette(“two-table”). If you are new to dplyr, the best place to start is the data transformation chapter in R for data science."
In other words, the package contains functions with names that sound like verbs (e.g., mutate
, select
, etc.), and together these functions can help with most of your data wrangling needs.
If you haven’t installed dplyr do so now:
We are huge fans of well-written vignettes and helpfiles, and one of the reasons why dplyr has caught on is because the authors have taken time to help users understand how to use their package. The examples in this chapter are from the dplyr website.
Before we get into examples of dplyr in action, let us introduce you to a built-in dataset called starwars.
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
## 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
## 2 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
## 3 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
## 4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
## 5 Leia Organa 150 49 brown light brown 19 female feminine Alderaan Human <chr [5]> <chr [1]> <chr [0]>
## 6 Owen Lars 178 120 brown, grey light blue 52 male masculine Tatooine Human <chr [3]> <chr [0]> <chr [0]>
The head
function returns a tibble with 6 rows and 14 columns. What is a tibble? This is an object from yet another tidyverse package called tibble. Tibbles are also described in the book, R for Data Science, where we read:
A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code.
When a tibble is printed, you can see the column names, but you can also quickly see the datatypes of each column. As you can see, the full dataset contains information about characters in the Star Wars series.
Here’s an example of how to use the dplyr functions (from the dplyr website):
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
## 1 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
## 2 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
## 3 R5-D4 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]>
## 4 IG-88 200 140 none metal red 15 none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
## 5 R4-P17 96 NA none silver, red red, blue NA none feminine <NA> Droid <chr [2]> <chr [0]> <chr [0]>
## 6 BB8 NA NA none none black NA none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
Now, what’s with the code itself? The code says “take the built in dataset (tibble) called starwars. Then use the filter
function to filter the dataset such that the column species == ‘Droid’”.
The symbol %>% is called a “pipe”, and can be read “then”. You typically start with the dataframe or tibble, and then use the dplyr verbs to wrangle your data. In this way, dplyr allows the code to be written as a series of chained commands, ending up with code that is more readable in some cases than base R code. That is what is meant by “a grammar of data manipulation”.
We can achieve this same code in base R commands, of course, with the subset
function.
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
## 1 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
## 2 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
## 3 R5-D4 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]>
## 4 IG-88 200 140 none metal red 15 none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
## 5 R4-P17 96 NA none silver, red red, blue NA none feminine <NA> Droid <chr [2]> <chr [0]> <chr [0]>
## 6 BB8 NA NA none none black NA none masculine <NA> Droid <chr [1]> <chr [0]> <chr [0]>
But the ability to chain the steps together with a pipe operator provides a readable story of how the data were wrangled. The pipe is technically called the forward pipe operator, and comes from another tidyverse package called magrittr. From this website, we see that "The magrittr package offers a set of operators which make your code more readable by:
- structuring sequences of data operations left-to-right (as opposed to from the inside and out),
- avoiding nested function calls,
- minimizing the need for local variables and function definitions, and
- making it easy to add steps anywhere in the sequence of operations.
The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x) with x %>% f(), where %>% is the (main) pipe-operator."
An introduction to pipes can be found in the R for Data Science pipes chapter. Let’s see this chaining in action, again borrowing an example from the dplyr website:
## # A tibble: 87 x 4
## name height mass bmi
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 26.0
## 2 C-3PO 167 75 26.9
## 3 R2-D2 96 32 34.7
## 4 Darth Vader 202 136 33.3
## 5 Leia Organa 150 49 21.8
## 6 Owen Lars 178 120 37.9
## 7 Beru Whitesun lars 165 75 27.5
## 8 R5-D4 97 32 34.0
## 9 Biggs Darklighter 183 84 25.1
## 10 Obi-Wan Kenobi 182 77 23.2
## # ... with 77 more rows
Here, we start with the built-in tibble called starwars. Then we use the mutate
function to mutate (change) the dataset by adding new variables and preserving existing ones. Specifically, we create a column called bmi, and calculate it as mass / ((height / 100) ^ 2)). Then we select the columns name through mass, and the column bmi as well.
Make sure to read the helpfiles for each function!
If you choose the pipe operator, you’ll want to check each step along the pipe to make sure the code returns what you think it should be returning. For example, to test the first step of the code, run it until you hit the second pipe, like this:
## # A tibble: 87 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships bmi
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> <dbl>
## 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2~ <chr [2]> 26.0
## 2 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0~ <chr [0]> 26.9
## 3 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0~ <chr [0]> 34.7
## 4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0~ <chr [1]> 33.3
## 5 Leia Organa 150 49 brown light brown 19 female feminine Alderaan Human <chr [5]> <chr [1~ <chr [0]> 21.8
## 6 Owen Lars 178 120 brown, grey light blue 52 male masculine Tatooine Human <chr [3]> <chr [0~ <chr [0]> 37.9
## 7 Beru Whitesun la~ 165 75 brown light blue 47 female feminine Tatooine Human <chr [3]> <chr [0~ <chr [0]> 27.5
## 8 R5-D4 97 32 <NA> white, red red NA none masculine Tatooine Droid <chr [1]> <chr [0~ <chr [0]> 34.0
## 9 Biggs Darklighter 183 84 black light brown 24 male masculine Tatooine Human <chr [1]> <chr [0~ <chr [1]> 25.1
## 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57 male masculine Stewjon Human <chr [6]> <chr [1~ <chr [5]> 23.2
## # ... with 77 more rows
Now the dataframe has 15 columns (the new column bmi has been added).
Once we’re sure that the mutate
function returned what we wanted, we can move to the next bit of code in the pipeline:
## # A tibble: 87 x 4
## name height mass bmi
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 26.0
## 2 C-3PO 167 75 26.9
## 3 R2-D2 96 32 34.7
## 4 Darth Vader 202 136 33.3
## 5 Leia Organa 150 49 21.8
## 6 Owen Lars 178 120 37.9
## 7 Beru Whitesun lars 165 75 27.5
## 8 R5-D4 97 32 34.0
## 9 Biggs Darklighter 183 84 25.1
## 10 Obi-Wan Kenobi 182 77 23.2
## # ... with 77 more rows
This addition now uses the selects specific columns with the select
function, in particular the columns name through mass, and then the bmi column.
Note that the functions automatically return the first 10 lines of the dataset as a tibble. Of could of course be done with base R functions, but the code is so clear and compact that it is a compelling choice of coding style.
Let’s look at another example from the dplyr website. The arrange
function provides a convenient way for arranging your data – moving columns around or sorting. For example, in the code below, you start with the starwars dataframe, and then arrange the dataset by descending mass:
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
## 1 Jabba Desiliji~ 175 1358 <NA> green-tan, br~ orange 600 hermaphrod~ masculi~ Nal Hutta Hutt <chr [~ <chr [0~ <chr [0]>
## 2 Grievous 216 159 none brown, white green, yell~ NA male masculi~ Kalee Kaleesh <chr [~ <chr [1~ <chr [1]>
## 3 IG-88 200 140 none metal red 15 none masculi~ <NA> Droid <chr [~ <chr [0~ <chr [0]>
## 4 Darth Vader 202 136 none white yellow 41.9 male masculi~ Tatooine Human <chr [~ <chr [0~ <chr [1]>
## 5 Tarfful 234 136 brown brown blue NA male masculi~ Kashyyyk Wookiee <chr [~ <chr [0~ <chr [0]>
## 6 Owen Lars 178 120 brown, grey light blue 52 male masculi~ Tatooine Human <chr [~ <chr [0~ <chr [0]>
## 7 Bossk 190 113 none green red 53 male masculi~ Trandosha Trandos~ <chr [~ <chr [0~ <chr [0]>
## 8 Chewbacca 228 112 brown unknown blue 200 male masculi~ Kashyyyk Wookiee <chr [~ <chr [1~ <chr [2]>
## 9 Jek Tono Porki~ 180 110 brown fair blue NA male masculi~ Bestine IV Human <chr [~ <chr [0~ <chr [1]>
## 10 Dexter Jettster 198 102 none brown yellow NA male masculi~ Ojom Besalisk <chr [~ <chr [0~ <chr [0]>
## # ... with 77 more rows
This, too, can be done with base R functions (we introduced order
in the last chapter), but again the pipe operator is what makes the steps more “story-like”.
The summarize
function is particularly useful and flexible. The code below says “Take the starwars tibble, then group by the column species. Then (for each group member), summarise the data as follows: create a column called n that will store sample size (computed as n()), create a column called mass (computed as the mean of mass, removing NA values). Then filter out records where n >1 and mass >50.”
starwars %>%
group_by(species) %>%
summarise(
n = n(),
mass = mean(mass, na.rm = TRUE)
) %>%
filter(
n > 1,
mass > 50
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 3
## species n mass
## <chr> <int> <dbl>
## 1 Droid 6 69.8
## 2 Gungan 3 74
## 3 Human 35 82.8
## 4 Kaminoan 2 88
## 5 Mirialan 2 53.1
## 6 Twi'lek 2 55
## 7 Wookiee 2 124
## 8 Zabrak 2 80
That’s a short introduction to dplyr. We certainly can’t beat the excellent tutorials that the package authors have provided to users, so we’ll call this a brief introduction and move on. If you code with other people, it is a good idea to have at least some idea of the pipe structure so that you can read other people’s code. Many packages have employed the pipe structure too, so we thought a short introduction to dplyr would be useful. A cheatsheet of the dplyr functions can be downloaded here.
What’s next? Now that we have finished our data wrangling chapters, we can work with our clean dataset and do some bonefide analyses.