Chapter 8 Data Wrangling with dplyr

Packages Required:

* dplyr - a package for working with spatial data

8.1 A Short Introduction to dplyr

In the last two chapters, we introduced several R functions that can be used to work with data. These included functions such as unique, names, str, summary, aggregate, and others. These are “base” R functions, and knowing a handful of common functions will serve you well.

Another set of functions for data wrangling are contained in the package, dplyr. This package is part of the tidyverse. This chapter will be a very, very short introduction to dplyr and its coding structure (which looks different than the coding structure we’ve used up to this point).

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

The dplyr package has its own website, and a full tutorial is provided in Chapter 5 of the ebook, R for Data Science by Garrett Grolemund and Hadley Wickham.

Here’s a snip from the dplyr website:

"dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation “by group”. You can learn more about them in vignette(“dplyr”). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette(“two-table”). If you are new to dplyr, the best place to start is the data transformation chapter in R for data science."

In other words, the package contains functions with names that sound like verbs (e.g., mutate, select, etc.), and together these functions can help with most of your data wrangling needs.

If you haven’t installed dplyr do so now:

install.packages("dplyr")
library(dplyr)

We are huge fans of well-written vignettes and helpfiles, and one of the reasons why dplyr has caught on is because the authors have taken time to help users understand how to use their package. The examples in this chapter are from the dplyr website.

Before we get into examples of dplyr in action, let us introduce you to a built-in dataset called starwars.

View(starwars)

head(starwars)

## # A tibble: 6 x 14
##   name           height  mass hair_color  skin_color  eye_color birth_year sex    gender    homeworld species films     vehicles  starships
##   <chr>           <int> <dbl> <chr>       <chr>       <chr>          <dbl> <chr>  <chr>     <chr>     <chr>   <list>    <list>    <list>   
## 1 Luke Skywalker    172    77 blond       fair        blue            19   male   masculine Tatooine  Human   <chr [5]> <chr [2]> <chr [2]>
## 2 C-3PO             167    75 <NA>        gold        yellow         112   none   masculine Tatooine  Droid   <chr [6]> <chr [0]> <chr [0]>
## 3 R2-D2              96    32 <NA>        white, blue red             33   none   masculine Naboo     Droid   <chr [7]> <chr [0]> <chr [0]>
## 4 Darth Vader       202   136 none        white       yellow          41.9 male   masculine Tatooine  Human   <chr [4]> <chr [0]> <chr [1]>
## 5 Leia Organa       150    49 brown       light       brown           19   female feminine  Alderaan  Human   <chr [5]> <chr [1]> <chr [0]>
## 6 Owen Lars         178   120 brown, grey light       blue            52   male   masculine Tatooine  Human   <chr [3]> <chr [0]> <chr [0]>

The head function returns a tibble with 6 rows and 14 columns. What is a tibble? This is an object from yet another tidyverse package called tibble. Tibbles are also described in the book, R for Data Science, where we read:

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code.

When a tibble is printed, you can see the column names, but you can also quickly see the datatypes of each column. As you can see, the full dataset contains information about characters in the Star Wars series.

Here’s an example of how to use the dplyr functions (from the dplyr website):

starwars %>%
  filter(species == "Droid")

## # A tibble: 6 x 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender    homeworld species films     vehicles  starships
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>     <chr>     <chr>   <list>    <list>    <list>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculine Tatooine  Droid   <chr [6]> <chr [0]> <chr [0]>
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculine Naboo     Droid   <chr [7]> <chr [0]> <chr [0]>
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculine Tatooine  Droid   <chr [1]> <chr [0]> <chr [0]>
## 4 IG-88     200   140 none       metal       red               15 none  masculine <NA>      Droid   <chr [1]> <chr [0]> <chr [0]>
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine  <NA>      Droid   <chr [2]> <chr [0]> <chr [0]>
## 6 BB8        NA    NA none       none        black             NA none  masculine <NA>      Droid   <chr [1]> <chr [0]> <chr [0]>

Now, what’s with the code itself? The code says “take the built in dataset (tibble) called starwars. Then use the filter function to filter the dataset such that the column species == ‘Droid’”.

The symbol %>% is called a “pipe”, and can be read “then”. You typically start with the dataframe or tibble, and then use the dplyr verbs to wrangle your data. In this way, dplyr allows the code to be written as a series of chained commands, ending up with code that is more readable in some cases than base R code. That is what is meant by “a grammar of data manipulation”.

We can achieve this same code in base R commands, of course, with the subset function.

subset(starwars, species == "Droid")

## # A tibble: 6 x 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender    homeworld species films     vehicles  starships
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>     <chr>     <chr>   <list>    <list>    <list>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculine Tatooine  Droid   <chr [6]> <chr [0]> <chr [0]>
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculine Naboo     Droid   <chr [7]> <chr [0]> <chr [0]>
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculine Tatooine  Droid   <chr [1]> <chr [0]> <chr [0]>
## 4 IG-88     200   140 none       metal       red               15 none  masculine <NA>      Droid   <chr [1]> <chr [0]> <chr [0]>
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine  <NA>      Droid   <chr [2]> <chr [0]> <chr [0]>
## 6 BB8        NA    NA none       none        black             NA none  masculine <NA>      Droid   <chr [1]> <chr [0]> <chr [0]>

But the ability to chain the steps together with a pipe operator provides a readable story of how the data were wrangled. The pipe is technically called the forward pipe operator, and comes from another tidyverse package called magrittr. From this website, we see that "The magrittr package offers a set of operators which make your code more readable by:

structuring sequences of data operations left-to-right (as opposed to from the inside and out),
avoiding nested function calls,
minimizing the need for local variables and function definitions, and
making it easy to add steps anywhere in the sequence of operations.

The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x) with x %>% f(), where %>% is the (main) pipe-operator."

An introduction to pipes can be found in the R for Data Science pipes chapter. Let’s see this chaining in action, again borrowing an example from the dplyr website:

starwars %>%
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)

## # A tibble: 87 x 4
##    name               height  mass   bmi
##    <chr>               <int> <dbl> <dbl>
##  1 Luke Skywalker        172    77  26.0
##  2 C-3PO                 167    75  26.9
##  3 R2-D2                  96    32  34.7
##  4 Darth Vader           202   136  33.3
##  5 Leia Organa           150    49  21.8
##  6 Owen Lars             178   120  37.9
##  7 Beru Whitesun lars    165    75  27.5
##  8 R5-D4                  97    32  34.0
##  9 Biggs Darklighter     183    84  25.1
## 10 Obi-Wan Kenobi        182    77  23.2
## # ... with 77 more rows

Here, we start with the built-in tibble called starwars. Then we use the mutate function to mutate (change) the dataset by adding new variables and preserving existing ones. Specifically, we create a column called bmi, and calculate it as mass / ((height / 100) ^ 2)). Then we select the columns name through mass, and the column bmi as well.

Make sure to read the helpfiles for each function!

help(mutate)
help(select)

If you choose the pipe operator, you’ll want to check each step along the pipe to make sure the code returns what you think it should be returning. For example, to test the first step of the code, run it until you hit the second pipe, like this:

starwars %>%
  mutate(name, bmi = mass / ((height / 100)  ^ 2))

## # A tibble: 87 x 15
##    name              height  mass hair_color    skin_color  eye_color birth_year sex    gender    homeworld species films     vehicles starships   bmi
##    <chr>              <int> <dbl> <chr>         <chr>       <chr>          <dbl> <chr>  <chr>     <chr>     <chr>   <list>    <list>   <list>    <dbl>
##  1 Luke Skywalker       172    77 blond         fair        blue            19   male   masculine Tatooine  Human   <chr [5]> <chr [2~ <chr [2]>  26.0
##  2 C-3PO                167    75 <NA>          gold        yellow         112   none   masculine Tatooine  Droid   <chr [6]> <chr [0~ <chr [0]>  26.9
##  3 R2-D2                 96    32 <NA>          white, blue red             33   none   masculine Naboo     Droid   <chr [7]> <chr [0~ <chr [0]>  34.7
##  4 Darth Vader          202   136 none          white       yellow          41.9 male   masculine Tatooine  Human   <chr [4]> <chr [0~ <chr [1]>  33.3
##  5 Leia Organa          150    49 brown         light       brown           19   female feminine  Alderaan  Human   <chr [5]> <chr [1~ <chr [0]>  21.8
##  6 Owen Lars            178   120 brown, grey   light       blue            52   male   masculine Tatooine  Human   <chr [3]> <chr [0~ <chr [0]>  37.9
##  7 Beru Whitesun la~    165    75 brown         light       blue            47   female feminine  Tatooine  Human   <chr [3]> <chr [0~ <chr [0]>  27.5
##  8 R5-D4                 97    32 <NA>          white, red  red             NA   none   masculine Tatooine  Droid   <chr [1]> <chr [0~ <chr [0]>  34.0
##  9 Biggs Darklighter    183    84 black         light       brown           24   male   masculine Tatooine  Human   <chr [1]> <chr [0~ <chr [1]>  25.1
## 10 Obi-Wan Kenobi       182    77 auburn, white fair        blue-gray       57   male   masculine Stewjon   Human   <chr [6]> <chr [1~ <chr [5]>  23.2
## # ... with 77 more rows

Now the dataframe has 15 columns (the new column bmi has been added).

Once we’re sure that the mutate function returned what we wanted, we can move to the next bit of code in the pipeline:

starwars %>%
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)

## # A tibble: 87 x 4
##    name               height  mass   bmi
##    <chr>               <int> <dbl> <dbl>
##  1 Luke Skywalker        172    77  26.0
##  2 C-3PO                 167    75  26.9
##  3 R2-D2                  96    32  34.7
##  4 Darth Vader           202   136  33.3
##  5 Leia Organa           150    49  21.8
##  6 Owen Lars             178   120  37.9
##  7 Beru Whitesun lars    165    75  27.5
##  8 R5-D4                  97    32  34.0
##  9 Biggs Darklighter     183    84  25.1
## 10 Obi-Wan Kenobi        182    77  23.2
## # ... with 77 more rows

This addition now uses the selects specific columns with the select function, in particular the columns name through mass, and then the bmi column.

Note that the functions automatically return the first 10 lines of the dataset as a tibble. Of could of course be done with base R functions, but the code is so clear and compact that it is a compelling choice of coding style.

Let’s look at another example from the dplyr website. The arrange function provides a convenient way for arranging your data – moving columns around or sorting. For example, in the code below, you start with the starwars dataframe, and then arrange the dataset by descending mass:

starwars %>%
  arrange(desc(mass))

## # A tibble: 87 x 14
##    name            height  mass hair_color  skin_color     eye_color    birth_year sex         gender   homeworld  species  films   vehicles starships
##    <chr>            <int> <dbl> <chr>       <chr>          <chr>             <dbl> <chr>       <chr>    <chr>      <chr>    <list>  <list>   <list>   
##  1 Jabba Desiliji~    175  1358 <NA>        green-tan, br~ orange            600   hermaphrod~ masculi~ Nal Hutta  Hutt     <chr [~ <chr [0~ <chr [0]>
##  2 Grievous           216   159 none        brown, white   green, yell~       NA   male        masculi~ Kalee      Kaleesh  <chr [~ <chr [1~ <chr [1]>
##  3 IG-88              200   140 none        metal          red                15   none        masculi~ <NA>       Droid    <chr [~ <chr [0~ <chr [0]>
##  4 Darth Vader        202   136 none        white          yellow             41.9 male        masculi~ Tatooine   Human    <chr [~ <chr [0~ <chr [1]>
##  5 Tarfful            234   136 brown       brown          blue               NA   male        masculi~ Kashyyyk   Wookiee  <chr [~ <chr [0~ <chr [0]>
##  6 Owen Lars          178   120 brown, grey light          blue               52   male        masculi~ Tatooine   Human    <chr [~ <chr [0~ <chr [0]>
##  7 Bossk              190   113 none        green          red                53   male        masculi~ Trandosha  Trandos~ <chr [~ <chr [0~ <chr [0]>
##  8 Chewbacca          228   112 brown       unknown        blue              200   male        masculi~ Kashyyyk   Wookiee  <chr [~ <chr [1~ <chr [2]>
##  9 Jek Tono Porki~    180   110 brown       fair           blue               NA   male        masculi~ Bestine IV Human    <chr [~ <chr [0~ <chr [1]>
## 10 Dexter Jettster    198   102 none        brown          yellow             NA   male        masculi~ Ojom       Besalisk <chr [~ <chr [0~ <chr [0]>
## # ... with 77 more rows

This, too, can be done with base R functions (we introduced order in the last chapter), but again the pipe operator is what makes the steps more “story-like”.

The summarize function is particularly useful and flexible. The code below says “Take the starwars tibble, then group by the column species. Then (for each group member), summarise the data as follows: create a column called n that will store sample size (computed as n()), create a column called mass (computed as the mean of mass, removing NA values). Then filter out records where n >1 and mass >50.”

starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(
    n > 1,
    mass > 50
  )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 8 x 3
##   species      n  mass
##   <chr>    <int> <dbl>
## 1 Droid        6  69.8
## 2 Gungan       3  74  
## 3 Human       35  82.8
## 4 Kaminoan     2  88  
## 5 Mirialan     2  53.1
## 6 Twi'lek      2  55  
## 7 Wookiee      2 124  
## 8 Zabrak       2  80

That’s a short introduction to dplyr. We certainly can’t beat the excellent tutorials that the package authors have provided to users, so we’ll call this a brief introduction and move on. If you code with other people, it is a good idea to have at least some idea of the pipe structure so that you can read other people’s code. Many packages have employed the pipe structure too, so we thought a short introduction to dplyr would be useful. A cheatsheet of the dplyr functions can be downloaded here.

What’s next? Now that we have finished our data wrangling chapters, we can work with our clean dataset and do some bonefide analyses.