Chapter 5 Descriptive statistics and data manipulation

Now that we are familiar with some R objects and know how to import data, it is time to write some code. In this chapter, we are going to compute descriptive statistics for a single dataset, but also for a list of datasets. However, I will not give a list of functions to compute descriptive statistics; if you need a specific function you can find easily in the Help pane in Rstudio or using any modern internet search engine. What I will do is show you a workflow that allows you to compute the descripitive statisics you need fast. R has a lot of built-in functions for descriptive statistics; however, if you want to compute statistics by, say, gender, some more complex manipulations are needed. At least this was true in the past. Nowadays, thanks to the packages from the tidyverse, it is very easy and fast to compute descriptive statistics by any stratifying variable(s). The package we are going to use for this is called dplyr. dplyr contains a lot of functions that make manipulating data and computing descriptive statistics very easy. To make things easier for now, we are going to use example data included with dplyr. So no need to import an external dataset; this does not change anything to the example that we are going to study here; the source of the data does not matter for this. Using dplyr is possible only if the data you are working with is already in a useful shape. When data is more messy, you will need to first manipulate it to bring it a tidy format. For this, we will use tidyr, which is very useful package to reshape data and to do advanced cleaning of your data. All these tidyverse functions are also called verbs. However, before getting to know these verbs, let’s do an analysis using standard, or base R functions. This will be the benchmark against which we are going to measure a {tidyverse} workflow.

5.1 A data exploration exercice using base R

Let’s first load the starwars data set, included in the {dplyr} package:

library(dplyr)
data(starwars)

Let’s first take a look at the data:

head(starwars)

## # A tibble: 6 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
## 1 Luke…    172    77 blond      fair       blue            19   male  
## 2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
## 3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
## 4 Dart…    202   136 none       white      yellow          41.9 male  
## 5 Leia…    150    49 brown      light      brown           19   female
## 6 Owen…    178   120 brown, gr… light      blue            52   male  
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

This data contains information on Star Wars characters. The first question you have to answer is to find the average height of the characters:

mean(starwars$height)

## [1] NA

Let’s also take a look at the standard deviation:

sd(starwars$height)

## [1] NA

It might be more informative to compute these two statistics by species, so for this, we are going to use aggregate():

aggregate(starwars$height, 
          by = list(Species = starwars$species), 
          mean)

##           Species        x
## 1          Aleena  79.0000
## 2        Besalisk 198.0000
## 3          Cerean 198.0000
## 4        Chagrian 196.0000
## 5        Clawdite 168.0000
## 6           Droid       NA
## 7             Dug 112.0000
## 8            Ewok  88.0000
## 9       Geonosian 183.0000
## 10         Gungan 208.6667
## 11          Human       NA
## 12           Hutt 175.0000
## 13       Iktotchi 188.0000
## 14        Kaleesh 216.0000
## 15       Kaminoan 221.0000
## 16        Kel Dor 188.0000
## 17       Mirialan 168.0000
## 18   Mon Calamari 180.0000
## 19           Muun 191.0000
## 20       Nautolan 196.0000
## 21      Neimodian 191.0000
## 22         Pau'an 206.0000
## 23       Quermian 264.0000
## 24         Rodian 173.0000
## 25        Skakoan 193.0000
## 26      Sullustan 160.0000
## 27     Tholothian 184.0000
## 28        Togruta 178.0000
## 29          Toong 163.0000
## 30      Toydarian 137.0000
## 31     Trandoshan 190.0000
## 32        Twi'lek 179.0000
## 33     Vulptereen  94.0000
## 34        Wookiee 231.0000
## 35          Xexto 122.0000
## 36 Yoda's species  66.0000
## 37         Zabrak 173.0000

Even if you are not familiar with aggregate(), I believe the above lines are quite self-explanatory. You need to provide aggregate() with 3 things; the variable you want to summarize (or only the data frame, if you want to summarize all variables), a list of grouping variables and then the function that will be applied to each subset. You can easily add another grouping variable:

aggregate(starwars$height, 
          by = list(Species = starwars$species,
                    Homeworld = starwars$homeworld), 
          mean)

##         Species      Homeworld        x
## 1         Human       Alderaan 176.3333
## 2        Aleena    Aleen Minor  79.0000
## 3         Human         Bespin 175.0000
## 4         Human     Bestine IV 180.0000
## 5     Neimodian Cato Neimoidia 191.0000
## 6        Cerean          Cerea 198.0000
## 7      Chagrian       Champala 196.0000
## 8         Human      Chandrila 150.0000
## 9         Human   Concord Dawn 183.0000
## 10        Human       Corellia 175.0000
## 11        Human      Coruscant 168.5000
## 12   Tholothian      Coruscant 184.0000
## 13       Zabrak       Dathomir 175.0000
## 14      Kel Dor          Dorin 188.0000
## 15         Ewok          Endor  88.0000
## 16        Human         Eriadu 180.0000
## 17    Geonosian       Geonosis 183.0000
## 18     Nautolan    Glee Anselm 196.0000
## 19        Human     Haruun Kal 188.0000
## 20     Iktotchi        Iktotch 188.0000
## 21       Zabrak       Iridonia 171.0000
## 22      Kaleesh          Kalee 216.0000
## 23        Human         Kamino 183.0000
## 24     Kaminoan         Kamino 221.0000
## 25      Wookiee       Kashyyyk 231.0000
## 26          Dug      Malastare 112.0000
## 27     Mirialan         Mirial 168.0000
## 28 Mon Calamari       Mon Cala 180.0000
## 29         Muun     Muunilinst 191.0000
## 30        Droid          Naboo  96.0000
## 31       Gungan          Naboo 208.6667
## 32        Human          Naboo 168.4000
## 33         Hutt      Nal Hutta 175.0000
## 34     Besalisk           Ojom 198.0000
## 35     Quermian        Quermia 264.0000
## 36       Rodian          Rodia 173.0000
## 37      Twi'lek         Ryloth 179.0000
## 38        Human        Serenno 193.0000
## 39      Togruta          Shili 178.0000
## 40      Skakoan          Skako 193.0000
## 41        Human        Socorro 177.0000
## 42        Human        Stewjon 182.0000
## 43    Sullustan        Sullust 160.0000
## 44        Droid       Tatooine 132.0000
## 45        Human       Tatooine 179.2500
## 46    Toydarian       Toydaria 137.0000
## 47   Trandoshan      Trandosha 190.0000
## 48        Xexto        Troiken 122.0000
## 49        Toong           Tund 163.0000
## 50       Pau'an         Utapau 206.0000
## 51   Vulptereen        Vulpter  94.0000
## 52     Clawdite          Zolan 168.0000

or use another function:

aggregate(starwars$height, 
          by = list(Species = starwars$species), 
          sd)

##           Species         x
## 1          Aleena        NA
## 2        Besalisk        NA
## 3          Cerean        NA
## 4        Chagrian        NA
## 5        Clawdite        NA
## 6           Droid        NA
## 7             Dug        NA
## 8            Ewok        NA
## 9       Geonosian        NA
## 10         Gungan 14.189198
## 11          Human        NA
## 12           Hutt        NA
## 13       Iktotchi        NA
## 14        Kaleesh        NA
## 15       Kaminoan 11.313708
## 16        Kel Dor        NA
## 17       Mirialan  2.828427
## 18   Mon Calamari        NA
## 19           Muun        NA
## 20       Nautolan        NA
## 21      Neimodian        NA
## 22         Pau'an        NA
## 23       Quermian        NA
## 24         Rodian        NA
## 25        Skakoan        NA
## 26      Sullustan        NA
## 27     Tholothian        NA
## 28        Togruta        NA
## 29          Toong        NA
## 30      Toydarian        NA
## 31     Trandoshan        NA
## 32        Twi'lek  1.414214
## 33     Vulptereen        NA
## 34        Wookiee  4.242641
## 35          Xexto        NA
## 36 Yoda's species        NA
## 37         Zabrak  2.828427

aggregate() returns a data.frame object:

class(aggregate(starwars$height, by = list(Species = starwars$species), mean))

## [1] "data.frame"

tapply() is another base R alternative:

tapply(starwars$height, list(starwars$species), mean)

##         Aleena       Besalisk         Cerean       Chagrian       Clawdite 
##        79.0000       198.0000       198.0000       196.0000       168.0000 
##          Droid            Dug           Ewok      Geonosian         Gungan 
##             NA       112.0000        88.0000       183.0000       208.6667 
##          Human           Hutt       Iktotchi        Kaleesh       Kaminoan 
##             NA       175.0000       188.0000       216.0000       221.0000 
##        Kel Dor       Mirialan   Mon Calamari           Muun       Nautolan 
##       188.0000       168.0000       180.0000       191.0000       196.0000 
##      Neimodian         Pau'an       Quermian         Rodian        Skakoan 
##       191.0000       206.0000       264.0000       173.0000       193.0000 
##      Sullustan     Tholothian        Togruta          Toong      Toydarian 
##       160.0000       184.0000       178.0000       163.0000       137.0000 
##     Trandoshan        Twi'lek     Vulptereen        Wookiee          Xexto 
##       190.0000       179.0000        94.0000       231.0000       122.0000 
## Yoda's species         Zabrak 
##        66.0000       173.0000

which returns an array object, which is similar to a vector.

However, tapply() does not work if you want the mean by species for all the variables in the data frame:

tapply(starwars, list(starwars$species), mean)

Error in tapply(starwars, list(starwars$species), mean) : 
  arguments must have same length

In both cases, you can only specify one function. So if you need the average and the standard deviation you have to do it in two steps.

Let’s continue now, by only computing the average height by species, but for males:

starwars_males <- subset(starwars, gender == "male")

aggregate(starwars_males$height, 
          by = list(Species = starwars_males$species), 
          mean)

##           Species        x
## 1          Aleena  79.0000
## 2        Besalisk 198.0000
## 3          Cerean 198.0000
## 4        Chagrian 196.0000
## 5             Dug 112.0000
## 6            Ewok  88.0000
## 7       Geonosian 183.0000
## 8          Gungan 208.6667
## 9           Human       NA
## 10       Iktotchi 188.0000
## 11        Kaleesh 216.0000
## 12       Kaminoan 229.0000
## 13        Kel Dor 188.0000
## 14   Mon Calamari 180.0000
## 15           Muun 191.0000
## 16       Nautolan 196.0000
## 17      Neimodian 191.0000
## 18         Pau'an 206.0000
## 19       Quermian 264.0000
## 20         Rodian 173.0000
## 21        Skakoan 193.0000
## 22      Sullustan 160.0000
## 23          Toong 163.0000
## 24      Toydarian 137.0000
## 25     Trandoshan 190.0000
## 26        Twi'lek 180.0000
## 27     Vulptereen  94.0000
## 28        Wookiee 231.0000
## 29          Xexto 122.0000
## 30 Yoda's species  66.0000
## 31         Zabrak 173.0000

I first use subset() to create a subset of the data in which I only kept males. Then, I rerun the analysis from before again. subset() can also be used to select columns. So if you want the average of the height and mass for males, you could do something like this:

starwars_males_height_mass <- subset(starwars, gender == "male", select = c(height, mass, species))

aggregate(starwars_males_height_mass, 
          by = list(Species = starwars_males_height_mass$species), 
          mean)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

##           Species   height mass species
## 1          Aleena  79.0000   15      NA
## 2        Besalisk 198.0000  102      NA
## 3          Cerean 198.0000   82      NA
## 4        Chagrian 196.0000   NA      NA
## 5             Dug 112.0000   40      NA
## 6            Ewok  88.0000   20      NA
## 7       Geonosian 183.0000   80      NA
## 8          Gungan 208.6667   NA      NA
## 9           Human       NA   NA      NA
## 10       Iktotchi 188.0000   NA      NA
## 11        Kaleesh 216.0000  159      NA
## 12       Kaminoan 229.0000   88      NA
## 13        Kel Dor 188.0000   80      NA
## 14   Mon Calamari 180.0000   83      NA
## 15           Muun 191.0000   NA      NA
## 16       Nautolan 196.0000   87      NA
## 17      Neimodian 191.0000   90      NA
## 18         Pau'an 206.0000   80      NA
## 19       Quermian 264.0000   NA      NA
## 20         Rodian 173.0000   74      NA
## 21        Skakoan 193.0000   48      NA
## 22      Sullustan 160.0000   68      NA
## 23          Toong 163.0000   65      NA
## 24      Toydarian 137.0000   NA      NA
## 25     Trandoshan 190.0000  113      NA
## 26        Twi'lek 180.0000   NA      NA
## 27     Vulptereen  94.0000   45      NA
## 28        Wookiee 231.0000  124      NA
## 29          Xexto 122.0000   NA      NA
## 30 Yoda's species  66.0000   17      NA
## 31         Zabrak 173.0000   NA      NA

This is starting to get a bit verbose, but it is quite easy to follow and very powerful. It certainly beats having to write loops to achieve the same thing.

Let’s now consider this new dataset:

survey_data_base

##   id var1 var2 var3
## 1  1  1.0  0.2  0.3
## 2  2  1.4  1.9  4.1
## 3  3  0.1  2.8  8.9
## 4  4  1.7  1.9  7.6

I will explain later where this comes from. Depending on what you want to do with this data, it is not in the right shape. So let’s reshape it, using the aptly-called reshape() command:

survey_data_long <- reshape(survey_data_base, 
        varying = list(2:4), v.names = "variable", direction = "long")

We can now easily compute the average of variable for each id:

aggregate(survey_data_long$variable,
          by = list(Id = survey_data_long$id),
          mean)

##   Id        x
## 1  1 0.500000
## 2  2 2.466667
## 3  3 3.933333
## 4  4 3.733333

There is also the possiblity to merge two datasets with merge(). I won’t go into that however. As you can see, R comes with very powerful functions right out of the box, ready to use. When I was studying, unfortunately, my professors had been brought up on FORTRAN loops, so we had to do to all this using loops (not reshaping, thankfully), which was not so easy. Now that we have seen how base R works, let’s redo the analysis using {tidyverse} verbs. But before deep diving into the {tidyverse}, let’s take a moment to discuss about our lord and saviour, %>%.

5.2 Smoking is bad for you, but pipes are your friend

The title of this section might sound weird at first, but by the end of it, you’ll get this (terrible) pun.

You probably know the following painting by René Magritte, La trahison des images:

It turns out there’s an R package from the tidyverse that is called magrittr. What does this package do? It brings pipes to R. Pipes are a concept from the Unix operating system; if you’re using a GNU+Linux distribution or macOS, you’re basically using a modern unix (that’s an oversimplification, but I’m an economist by training, and outrageously oversimplifying things is what we do, deal with it).

The idea of pipes is to take the output of a command, and feed it as the input of another command. The magrittr package brings pipes to R, by using the weird looking %>%. Try the following:

library(magrittr)

16 %>% sqrt

## [1] 4

This looks quite weird, but you probably understand what happened; 16 got fed as the first argument of the function sqrt(). You can chain multiple functions:

16 %>% sqrt %>% `+`(18)

## [1] 22

The output of 16 (16) got fed to sqrt(), and the output of sqrt(16) (4) got fed to +(18) (22). Without %>% you’d write the line just above like this:

sqrt(16) + 18

## [1] 22

It might not be very clear right now why this is useful, but the %>% is probably one of the most useful infix operators, because when using packages from the tidyverse, you will naturally want to chain a lot of functions together. Without the %>% it would become messy very fast.

%>% is not the only pipe operator in magrittr. There’s %T%, %<>% and %$%. All have their uses, but are basically shortcuts to some common tasks with %>% plus another function. Which means that you can live without them, and because of this, I will not discuss them.

5.3 The `{tidyverse}`’s enfant prodige: `{dplyr}`

The best way to get started with the tidyverse packages is to get to know {dplyr}. {dplyr} prodives a lot of very useful functions that makes it very easy to get discriptive statistics or add new columns to your data.

5.3.1 A first taste of data manipulation with `{dplyr}`

This section will walk you through a typical analysis using {dplyr} funcitons. Just go with it; I will give more details in the next sections.

First, let’s load dplyr and the included starwars dataset. Let’s also take a look at the first 5 lines of the dataset:

library(dplyr)

data(starwars)

head(starwars)

## # A tibble: 6 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
## 1 Luke…    172    77 blond      fair       blue            19   male  
## 2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
## 3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
## 4 Dart…    202   136 none       white      yellow          41.9 male  
## 5 Leia…    150    49 brown      light      brown           19   female
## 6 Owen…    178   120 brown, gr… light      blue            52   male  
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

data(starwars) loads the example dataset called starwars that is included in the package dplyr. As I said earlier, this is just an example; you could have loaded an external dataset, from a .csv file for instance. This does not matter for what comes next.

R includes a lot of functions for descriptive statistics, such as mean(), sd(), cov(), and many more. What dplyr brings to the table (among other niceties) is the possibility to apply these functions to the dataset easily. For example, imagine you want the average height of everyone in the dataset. Using the basic R functions, you could write this:

mean(starwars$height)

## [1] NA

starwars$height means that the user wants to access the column called height from the dataset starwars. Remember that the $ symbol is how you access elements of a named list. This is the same for columns of datasets as you can see. This is then given as an argument to the function mean(). But what if the user wants the average height by species? Before dplyr, a solution to this simple problem would have required more than a single command. Now this is as easy as:

starwars %>%
  group_by(species) %>%
  summarise(mean(height))

## # A tibble: 38 x 2
##    species   `mean(height)`
##    <chr>              <dbl>
##  1 Aleena               79 
##  2 Besalisk            198 
##  3 Cerean              198 
##  4 Chagrian            196 
##  5 Clawdite            168 
##  6 Droid                NA 
##  7 Dug                 112 
##  8 Ewok                 88 
##  9 Geonosian           183 
## 10 Gungan              209.
## # ... with 28 more rows

The usefulness of the %>% (pipe operator) becomes apparent now. Without it, one would write instead:

summarise(group_by(starwars, species), mean(height))

## # A tibble: 38 x 2
##    species   `mean(height)`
##    <chr>              <dbl>
##  1 Aleena               79 
##  2 Besalisk            198 
##  3 Cerean              198 
##  4 Chagrian            196 
##  5 Clawdite            168 
##  6 Droid                NA 
##  7 Dug                 112 
##  8 Ewok                 88 
##  9 Geonosian           183 
## 10 Gungan              209.
## # ... with 28 more rows

as you can clearly see, it is much more difficult to read. Imagine now that I want the average height by species, but only for males. Again, this is very easy using %>%:

starwars %>%
  filter(gender == "male") %>%
  group_by(species) %>%
  summarise(mean(height))

## # A tibble: 32 x 2
##    species   `mean(height)`
##    <chr>              <dbl>
##  1 Aleena               79 
##  2 Besalisk            198 
##  3 Cerean              198 
##  4 Chagrian            196 
##  5 Dug                 112 
##  6 Ewok                 88 
##  7 Geonosian           183 
##  8 Gungan              209.
##  9 Human                NA 
## 10 Iktotchi            188 
## # ... with 22 more rows

Again, the %>% makes the above lines of code very easy to read. Without it, one would need to write:

summarise(group_by(filter(starwars, gender == "male"), species), mean(height))

## # A tibble: 32 x 2
##    species   `mean(height)`
##    <chr>              <dbl>
##  1 Aleena               79 
##  2 Besalisk            198 
##  3 Cerean              198 
##  4 Chagrian            196 
##  5 Dug                 112 
##  6 Ewok                 88 
##  7 Geonosian           183 
##  8 Gungan              209.
##  9 Human                NA 
## 10 Iktotchi            188 
## # ... with 22 more rows

I think you agree with me that this is not very readable. One way to make it more readable would be to save intermediary variables:

filtered_data <- filter(starwars, gender == "male")

grouped_data <- group_by(filter(starwars, gender == "male"), species)

summarise(grouped_data, mean(height))

## # A tibble: 32 x 2
##    species   `mean(height)`
##    <chr>              <dbl>
##  1 Aleena               79 
##  2 Besalisk            198 
##  3 Cerean              198 
##  4 Chagrian            196 
##  5 Dug                 112 
##  6 Ewok                 88 
##  7 Geonosian           183 
##  8 Gungan              209.
##  9 Human                NA 
## 10 Iktotchi            188 
## # ... with 22 more rows

But this can get very tedious. Once you’re used to %>%, you won’t go back to not use it.

Before continuing and to make things clearer; filter(), group_by() and summarise() are functions that are included in dplyr. %>% is actually a function from magrittr, but this package gets loaded on the fly when you load dplyr, so you do not need to worry about it. mean() is a function native to R.

The result of all these operations that use dplyr functions are actually other datasets, or tibbles. This means that you can save them in variable, and then work with these as any other datasets.

mean_height <- starwars %>%
  group_by(species) %>%
  summarise(mean(height))

class(mean_height)

## [1] "tbl_df"     "tbl"        "data.frame"

head(mean_height)

## # A tibble: 6 x 2
##   species  `mean(height)`
##   <chr>             <dbl>
## 1 Aleena               79
## 2 Besalisk            198
## 3 Cerean              198
## 4 Chagrian            196
## 5 Clawdite            168
## 6 Droid                NA

You could then write this data to disk using rio::export() for instance. If you need more than the mean of the height, you can keep adding as many functions as needed:

summary_table <- starwars %>%
  group_by(species) %>%
  summarise(ave_height = mean(height), var_height = var(height), n_obs = n())

print(summary_table)

## # A tibble: 38 x 4
##    species   ave_height var_height n_obs
##    <chr>          <dbl>      <dbl> <int>
##  1 Aleena           79         NA      1
##  2 Besalisk        198         NA      1
##  3 Cerean          198         NA      1
##  4 Chagrian        196         NA      1
##  5 Clawdite        168         NA      1
##  6 Droid            NA         NA      5
##  7 Dug             112         NA      1
##  8 Ewok             88         NA      1
##  9 Geonosian       183         NA      1
## 10 Gungan          209.       201.     3
## # ... with 28 more rows

I’ve added more functions, namely var(), to get the variance of height, and n(), which is a function from dplyr, not base R, to get the number of observations. This is quite useful, because we see that for a lot of species we only have one single individual! Let’s focus on the species for which we have more than 1 individual. Since we save all the previous operations (which produce a tibble) in a variable, we can keep going from there:

summary_table2 <- summary_table %>%
  filter(n_obs > 1)

print(summary_table2)

## # A tibble: 9 x 4
##   species  ave_height var_height n_obs
##   <chr>         <dbl>      <dbl> <int>
## 1 Droid           NA         NA      5
## 2 Gungan         209.       201.     3
## 3 Human           NA         NA     35
## 4 Kaminoan       221        128      2
## 5 Mirialan       168          8      2
## 6 Twi'lek        179          2      2
## 7 Wookiee        231         18      2
## 8 Zabrak         173          8      2
## 9 <NA>            NA         NA      5

There’s a lot of NAs; this is because by default, mean() and var() return NA if even one single observation is NA. This is good, because it forces you to look at the data to see what is going on. If you would get a number, even if there were NAs you could very easily miss these missing values. It is better for functions to fail early and often than the opposite. mean() and var() have a na.rm option that the user can set to TRUE to get the result by ignoring the NAs:

starwars %>%
  group_by(species) %>%
  summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>%
  filter(n_obs > 1)

## # A tibble: 9 x 4
##   species  ave_height var_height n_obs
##   <chr>         <dbl>      <dbl> <int>
## 1 Droid          140       2705.     5
## 2 Gungan         209.       201.     3
## 3 Human          177.       157.    35
## 4 Kaminoan       221        128      2
## 5 Mirialan       168          8      2
## 6 Twi'lek        179          2      2
## 7 Wookiee        231         18      2
## 8 Zabrak         173          8      2
## 9 <NA>           160       1826      5

In the code above, I have combined the two previous steps to get the result I’m interested in. There’s a line in the final output that says NA for the species. Let’s go back to the raw data and find these lines:

starwars %>%
  filter(is.na(species))

## # A tibble: 5 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
## 1 Ric …    183    NA brown      fair       blue              NA male  
## 2 Quar…    183    NA black      dark       brown             62 male  
## 3 R4-P…     96    NA none       silver, r… red, blue         NA female
## 4 Sly …    178    48 none       pale       white             NA female
## 5 Capt…     NA    NA unknown    unknown    unknown           NA female
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

To test for NA, one uses the function is.na() not something like species == "NA" or anything like that. !is.na() does the opposite:

starwars %>%
  filter(!is.na(species))

## # A tibble: 82 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # ... with 72 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

The ! function negates a predicate function (a predicate function is a function that returns TRUE or FALSE). We can then rerun our analysis from before:

starwars %>%
  filter(!is.na(species)) %>%
  group_by(species) %>%
  summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>%
  filter(n_obs > 1)

## # A tibble: 8 x 4
##   species  ave_height var_height n_obs
##   <chr>         <dbl>      <dbl> <int>
## 1 Droid          140       2705.     5
## 2 Gungan         209.       201.     3
## 3 Human          177.       157.    35
## 4 Kaminoan       221        128      2
## 5 Mirialan       168          8      2
## 6 Twi'lek        179          2      2
## 7 Wookiee        231         18      2
## 8 Zabrak         173          8      2

And why not compute the same table, but first add another stratifying variable?

starwars %>%
  filter(!is.na(species)) %>%
  group_by(species, gender) %>%
  summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>%
  filter(n_obs > 1)

## # A tibble: 8 x 5
## # Groups:   species [6]
##   species  gender ave_height var_height n_obs
##   <chr>    <chr>       <dbl>      <dbl> <int>
## 1 Droid    none         200        NA       2
## 2 Droid    <NA>         120      1657       3
## 3 Gungan   male         209.      201.      3
## 4 Human    female       160.       48.8     9
## 5 Human    male         182.       67.1    26
## 6 Mirialan female       168         8       2
## 7 Wookiee  male         231        18       2
## 8 Zabrak   male         173         8       2

Ok, that’s it for a first taste. We have already discovered some very useful {dplyr} functions, filter(), group_by() and summarise summarise().

Now, we are going to learn more about these functions in more detail.

5.3.2 Filter the rows of a dataset with `filter()`

We’re going to use the Gasoline dataset from the plm package, so install that first:

install.packages("plm")

Then load the required data:

data(Gasoline, package = "plm")

and load dplyr:

library(dplyr)

This dataset gives the consumption of gasoline for 18 countries from 1960 to 1978. When you load the data like this, it is a standard data.frame. dplyr functions can be used on standard data.frame objects, but also on tibbles. tibbles are just like data frame, but with a better print method (and other niceties). I’ll discuss the {tibble} package later, but for now, let’s convert the data to a tibble and change its name:

gasoline <- as_tibble(Gasoline)

filter() is pretty straightforward. What if you would like to subset the data to focus on the year 1969? Simple:

filter(gasoline, year == 1969)

## # A tibble: 18 x 6
##    country   year lgaspcar lincomep  lrpmg lcarpcap
##    <fct>    <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA   1969     4.05    -6.15 -0.559    -8.79
##  2 BELGIUM   1969     3.85    -5.86 -0.355    -8.52
##  3 CANADA    1969     4.86    -5.56 -1.04     -8.10
##  4 DENMARK   1969     4.17    -5.72 -0.407    -8.47
##  5 FRANCE    1969     3.77    -5.84 -0.315    -8.37
##  6 GERMANY   1969     3.90    -5.83 -0.589    -8.44
##  7 GREECE    1969     4.89    -6.59 -0.180   -10.7 
##  8 IRELAND   1969     4.21    -6.38 -0.272    -8.95
##  9 ITALY     1969     3.74    -6.28 -0.248    -8.67
## 10 JAPAN     1969     4.52    -6.16 -0.417    -9.61
## 11 NETHERLA  1969     3.99    -5.88 -0.417    -8.63
## 12 NORWAY    1969     4.09    -5.74 -0.338    -8.69
## 13 SPAIN     1969     3.99    -5.60  0.669    -9.72
## 14 SWEDEN    1969     3.99    -7.77 -2.73     -8.20
## 15 SWITZERL  1969     4.21    -5.91 -0.918    -8.47
## 16 TURKEY    1969     5.72    -7.39 -0.298   -12.5 
## 17 U.K.      1969     3.95    -6.03 -0.383    -8.47
## 18 U.S.A.    1969     4.84    -5.41 -1.22     -7.79

Let’s use %>%, since we’re familiar with it now:

gasoline %>% filter(year == 1969)

## # A tibble: 18 x 6
##    country   year lgaspcar lincomep  lrpmg lcarpcap
##    <fct>    <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA   1969     4.05    -6.15 -0.559    -8.79
##  2 BELGIUM   1969     3.85    -5.86 -0.355    -8.52
##  3 CANADA    1969     4.86    -5.56 -1.04     -8.10
##  4 DENMARK   1969     4.17    -5.72 -0.407    -8.47
##  5 FRANCE    1969     3.77    -5.84 -0.315    -8.37
##  6 GERMANY   1969     3.90    -5.83 -0.589    -8.44
##  7 GREECE    1969     4.89    -6.59 -0.180   -10.7 
##  8 IRELAND   1969     4.21    -6.38 -0.272    -8.95
##  9 ITALY     1969     3.74    -6.28 -0.248    -8.67
## 10 JAPAN     1969     4.52    -6.16 -0.417    -9.61
## 11 NETHERLA  1969     3.99    -5.88 -0.417    -8.63
## 12 NORWAY    1969     4.09    -5.74 -0.338    -8.69
## 13 SPAIN     1969     3.99    -5.60  0.669    -9.72
## 14 SWEDEN    1969     3.99    -7.77 -2.73     -8.20
## 15 SWITZERL  1969     4.21    -5.91 -0.918    -8.47
## 16 TURKEY    1969     5.72    -7.39 -0.298   -12.5 
## 17 U.K.      1969     3.95    -6.03 -0.383    -8.47
## 18 U.S.A.    1969     4.84    -5.41 -1.22     -7.79

You can also filter more than just one year, by using the %in% operator:

gasoline %>% filter(year %in% seq(1969, 1973))

## # A tibble: 90 x 6
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##    <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
##  2 AUSTRIA  1970     4.08    -6.08 -0.597    -8.73
##  3 AUSTRIA  1971     4.11    -6.04 -0.654    -8.64
##  4 AUSTRIA  1972     4.13    -5.98 -0.596    -8.54
##  5 AUSTRIA  1973     4.20    -5.90 -0.594    -8.49
##  6 BELGIUM  1969     3.85    -5.86 -0.355    -8.52
##  7 BELGIUM  1970     3.87    -5.80 -0.378    -8.45
##  8 BELGIUM  1971     3.87    -5.76 -0.399    -8.41
##  9 BELGIUM  1972     3.91    -5.71 -0.311    -8.36
## 10 BELGIUM  1973     3.90    -5.64 -0.373    -8.31
## # ... with 80 more rows

It is also possible use between(), a helper function:

gasoline %>% filter(between(year, 1969, 1973))

## # A tibble: 90 x 6
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##    <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
##  2 AUSTRIA  1970     4.08    -6.08 -0.597    -8.73
##  3 AUSTRIA  1971     4.11    -6.04 -0.654    -8.64
##  4 AUSTRIA  1972     4.13    -5.98 -0.596    -8.54
##  5 AUSTRIA  1973     4.20    -5.90 -0.594    -8.49
##  6 BELGIUM  1969     3.85    -5.86 -0.355    -8.52
##  7 BELGIUM  1970     3.87    -5.80 -0.378    -8.45
##  8 BELGIUM  1971     3.87    -5.76 -0.399    -8.41
##  9 BELGIUM  1972     3.91    -5.71 -0.311    -8.36
## 10 BELGIUM  1973     3.90    -5.64 -0.373    -8.31
## # ... with 80 more rows

To select non-consecutive years:

gasoline %>% filter(year %in% c(1969, 1973, 1977))

## # A tibble: 54 x 6
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##    <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
##  2 AUSTRIA  1973     4.20    -5.90 -0.594    -8.49
##  3 AUSTRIA  1977     3.93    -5.83 -0.422    -8.25
##  4 BELGIUM  1969     3.85    -5.86 -0.355    -8.52
##  5 BELGIUM  1973     3.90    -5.64 -0.373    -8.31
##  6 BELGIUM  1977     3.85    -5.56 -0.432    -8.14
##  7 CANADA   1969     4.86    -5.56 -1.04     -8.10
##  8 CANADA   1973     4.90    -5.41 -1.13     -7.94
##  9 CANADA   1977     4.81    -5.34 -1.07     -7.77
## 10 DENMARK  1969     4.17    -5.72 -0.407    -8.47
## # ... with 44 more rows

%in% tests if an object is part of a set.

5.3.3 Select columns with `select()`

While filter() allows you to keep or discard rows of data, select() allows you to keep or discard entire columns. To keep columns:

gasoline %>% select(country, year, lrpmg)

## # A tibble: 342 x 3
##    country  year  lrpmg
##  * <fct>   <int>  <dbl>
##  1 AUSTRIA  1960 -0.335
##  2 AUSTRIA  1961 -0.351
##  3 AUSTRIA  1962 -0.380
##  4 AUSTRIA  1963 -0.414
##  5 AUSTRIA  1964 -0.445
##  6 AUSTRIA  1965 -0.497
##  7 AUSTRIA  1966 -0.467
##  8 AUSTRIA  1967 -0.506
##  9 AUSTRIA  1968 -0.522
## 10 AUSTRIA  1969 -0.559
## # ... with 332 more rows

To discard them:

gasoline %>% select(-country, -year, -lrpmg)

## # A tibble: 342 x 3
##    lgaspcar lincomep lcarpcap
##  *    <dbl>    <dbl>    <dbl>
##  1     4.17    -6.47    -9.77
##  2     4.10    -6.43    -9.61
##  3     4.07    -6.41    -9.46
##  4     4.06    -6.37    -9.34
##  5     4.04    -6.32    -9.24
##  6     4.03    -6.29    -9.12
##  7     4.05    -6.25    -9.02
##  8     4.05    -6.23    -8.93
##  9     4.05    -6.21    -8.85
## 10     4.05    -6.15    -8.79
## # ... with 332 more rows

To rename them:

gasoline %>% select(country, date = year, lrpmg)

## # A tibble: 342 x 3
##    country  date  lrpmg
##  * <fct>   <int>  <dbl>
##  1 AUSTRIA  1960 -0.335
##  2 AUSTRIA  1961 -0.351
##  3 AUSTRIA  1962 -0.380
##  4 AUSTRIA  1963 -0.414
##  5 AUSTRIA  1964 -0.445
##  6 AUSTRIA  1965 -0.497
##  7 AUSTRIA  1966 -0.467
##  8 AUSTRIA  1967 -0.506
##  9 AUSTRIA  1968 -0.522
## 10 AUSTRIA  1969 -0.559
## # ... with 332 more rows

There’s also rename():

gasoline %>% rename(date = year)

## # A tibble: 342 x 6
##    country  date lgaspcar lincomep  lrpmg lcarpcap
##  * <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

rename() does not do any kind of selection, but just renames.

You can also use select() to re-order columns:

gasoline %>% select(year, country, lrpmg, everything())

## # A tibble: 342 x 6
##     year country  lrpmg lgaspcar lincomep lcarpcap
##  * <int> <fct>    <dbl>    <dbl>    <dbl>    <dbl>
##  1  1960 AUSTRIA -0.335     4.17    -6.47    -9.77
##  2  1961 AUSTRIA -0.351     4.10    -6.43    -9.61
##  3  1962 AUSTRIA -0.380     4.07    -6.41    -9.46
##  4  1963 AUSTRIA -0.414     4.06    -6.37    -9.34
##  5  1964 AUSTRIA -0.445     4.04    -6.32    -9.24
##  6  1965 AUSTRIA -0.497     4.03    -6.29    -9.12
##  7  1966 AUSTRIA -0.467     4.05    -6.25    -9.02
##  8  1967 AUSTRIA -0.506     4.05    -6.23    -8.93
##  9  1968 AUSTRIA -0.522     4.05    -6.21    -8.85
## 10  1969 AUSTRIA -0.559     4.05    -6.15    -8.79
## # ... with 332 more rows

everything() is a helper function, and there’s also starts_with(), and ends_with(). For example, what if we are only interested in columns whose name start with “l”?

gasoline %>% select(starts_with("l"))

## # A tibble: 342 x 4
##    lgaspcar lincomep  lrpmg lcarpcap
##  *    <dbl>    <dbl>  <dbl>    <dbl>
##  1     4.17    -6.47 -0.335    -9.77
##  2     4.10    -6.43 -0.351    -9.61
##  3     4.07    -6.41 -0.380    -9.46
##  4     4.06    -6.37 -0.414    -9.34
##  5     4.04    -6.32 -0.445    -9.24
##  6     4.03    -6.29 -0.497    -9.12
##  7     4.05    -6.25 -0.467    -9.02
##  8     4.05    -6.23 -0.506    -8.93
##  9     4.05    -6.21 -0.522    -8.85
## 10     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

ends_with() works in a similar fashion. There is also contains():

gasoline %>% select(country, year, contains("car"))

## # A tibble: 342 x 4
##    country  year lgaspcar lcarpcap
##  * <fct>   <int>    <dbl>    <dbl>
##  1 AUSTRIA  1960     4.17    -9.77
##  2 AUSTRIA  1961     4.10    -9.61
##  3 AUSTRIA  1962     4.07    -9.46
##  4 AUSTRIA  1963     4.06    -9.34
##  5 AUSTRIA  1964     4.04    -9.24
##  6 AUSTRIA  1965     4.03    -9.12
##  7 AUSTRIA  1966     4.05    -9.02
##  8 AUSTRIA  1967     4.05    -8.93
##  9 AUSTRIA  1968     4.05    -8.85
## 10 AUSTRIA  1969     4.05    -8.79
## # ... with 332 more rows

Another verb, similar to select(), is pull(). Let’s compare the two:

gasoline %>% select(lrpmg)

## # A tibble: 342 x 1
##     lrpmg
##  *  <dbl>
##  1 -0.335
##  2 -0.351
##  3 -0.380
##  4 -0.414
##  5 -0.445
##  6 -0.497
##  7 -0.467
##  8 -0.506
##  9 -0.522
## 10 -0.559
## # ... with 332 more rows

gasoline %>% pull(lrpmg)

##   [1] -0.33454761 -0.35132761 -0.37951769 -0.41425139 -0.44533536
##   [6] -0.49706066 -0.46683773 -0.50588340 -0.52241255 -0.55911051
##  [11] -0.59656122 -0.65445914 -0.59633184 -0.59444681 -0.46602693
##  [16] -0.45414221 -0.50008372 -0.42191563 -0.46960312 -0.16570961
##  [21] -0.17173098 -0.22229138 -0.25046225 -0.27591057 -0.34493695
##  [26] -0.23639770 -0.26699499 -0.31116076 -0.35480852 -0.37794044
##  [31] -0.39922992 -0.31064584 -0.37309192 -0.36223563 -0.36430848
##  [36] -0.37896584 -0.43164133 -0.59094964 -0.97210650 -0.97229024
##  [41] -0.97860756 -1.01904791 -1.00285696 -1.01712549 -1.01694436
##  [46] -1.02359713 -1.01984524 -1.03686389 -1.06733308 -1.05803676
##  [51] -1.09966703 -1.13316142 -1.12379997 -1.18568427 -1.06179659
##  [56] -1.07084448 -1.07495073 -0.19570260 -0.25361844 -0.21875400
##  [61] -0.24800936 -0.30654923 -0.32701542 -0.39618846 -0.44257369
##  [66] -0.35204752 -0.40687922 -0.44046082 -0.45473954 -0.49918863
##  [71] -0.43257185 -0.42517720 -0.39395431 -0.35361534 -0.35690917
##  [76] -0.29068135 -0.01959833 -0.02386000 -0.06892022 -0.13792900
##  [81] -0.19784646 -0.23365325 -0.26427164 -0.29405795 -0.32316179
##  [86] -0.31519087 -0.33384616 -0.37945667 -0.40781642 -0.47503429
##  [91] -0.21698191 -0.25838174 -0.24651309 -0.22550681 -0.38075942
##  [96] -0.18591078 -0.23095384 -0.34384171 -0.37464672 -0.39965256
## [101] -0.43987825 -0.54000197 -0.54998139 -0.43824222 -0.58923137
## [106] -0.63329520 -0.67176311 -0.71797458 -0.72587521 -0.56982876
## [111] -0.56482380 -0.62481298 -0.59761210 -0.62817279 -0.08354740
## [116] -0.10421997 -0.13320751 -0.15653576 -0.18051772 -0.07793999
## [121] -0.11491900 -0.13775849 -0.15375883 -0.17986997 -0.20252426
## [126] -0.06761078 -0.11973059 -0.05191029  0.31625351  0.20631574
## [131]  0.19319312  0.23502961  0.16896037 -0.07648118 -0.12040874
## [136] -0.14160039 -0.15232915 -0.24428212 -0.16899366 -0.21071901
## [141] -0.17383533 -0.21339314 -0.27162842 -0.32069023 -0.36041067
## [146] -0.42393131 -0.64567297 -0.55343875 -0.64126416 -0.66134256
## [151] -0.56011483 -0.66277808  0.16507708 -0.08559038 -0.18351291
## [156] -0.26541405 -0.42609643 -0.32712637 -0.24887418 -0.19160048
## [161] -0.20616656 -0.24756681 -0.23271512 -0.14822267 -0.21508857
## [166] -0.32508487 -0.22290860 -0.03270913  0.10292798  0.16418805
## [171]  0.03482212 -0.14532271 -0.14874940 -0.18731459 -0.19996473
## [176] -0.20386433 -0.23786571 -0.27411537 -0.33167240 -0.35126918
## [181] -0.41685019 -0.46203546 -0.43941354 -0.52100094 -0.46270739
## [186] -0.19090636 -0.15948473 -0.20726559 -0.21904447 -0.28707638
## [191] -0.20148480 -0.21599265 -0.25968008 -0.29718661 -0.36929389
## [196] -0.34197503 -0.34809007 -0.31232019 -0.44450431 -0.41694955
## [201] -0.39954544 -0.43393029 -0.31903240 -0.42728193 -0.35253685
## [206] -0.43426178 -0.42908393 -0.46474195 -0.55791459 -0.13968957
## [211] -0.15790514 -0.19908809 -0.23263318 -0.26374731 -0.31593124
## [216] -0.25011726 -0.26555763 -0.30036775 -0.33823045 -0.39072560
## [221] -0.30127223 -0.26023925 -0.33880765 -0.15100924 -0.32726757
## [226] -0.35308752 -0.38255762 -0.30765935  1.12531070  1.10956235
## [231]  1.05700394  0.97683534  0.91532254  0.81666055  0.75671751
## [236]  0.74130811  0.70386453  0.66948950  0.61217208  0.60699563
## [241]  0.53716844  0.43377166  0.52492096  0.62955545  0.68385409
## [246]  0.52627167  0.62141374 -2.52041588 -2.57148340 -2.53448158
## [251] -2.60511224 -2.65801626 -2.64476790 -2.63901460 -2.65609762
## [256] -2.67918662 -2.73190414 -2.73359211 -2.77884554 -2.77467537
## [261] -2.84142900 -2.79840677 -2.76731461 -2.82294480 -2.82005896
## [266] -2.89649671 -0.82321833 -0.86558473 -0.82218510 -0.86012004
## [271] -0.86767682 -0.90528668 -0.85956665 -0.90656671 -0.87232520
## [276] -0.91812162 -0.96344188 -1.03746081 -0.94015345 -0.86722756
## [281] -0.88692306 -0.88475790 -0.90736205 -0.91147285 -1.03208811
## [286] -0.25340821 -0.34252375 -0.40820484 -0.22499174 -0.25219448
## [291] -0.29347614 -0.35640491 -0.33515022 -0.36507386 -0.29845417
## [296] -0.39882648 -0.30461880 -0.54637424 -0.69162023 -0.33965308
## [301] -0.53794675 -0.75141027 -0.95552413 -0.35290961 -0.39108581
## [306] -0.45185308 -0.42287690 -0.46335147 -0.49577430 -0.42654915
## [311] -0.47068145 -0.44118786 -0.46245080 -0.38332457 -0.41899030
## [316] -0.46135978 -0.52777246 -0.56529718 -0.56641296 -0.20867428
## [321] -0.27354010 -0.50886285 -0.78652911 -1.12111489 -1.14624034
## [326] -1.16187449 -1.17991524 -1.20026222 -1.19428750 -1.19026054
## [331] -1.18991215 -1.20730059 -1.22314272 -1.25176347 -1.28131560
## [336] -1.33116930 -1.29066967 -1.23146686 -1.20037697 -1.15468197
## [341] -1.17590974 -1.21206183

pull(), unlike select(), does not return a tibble, but only the column you want.

5.3.4 Group the observations of your dataset with `group_by()`

group_by() is a very useful verb; as the name implies, it allows you to create groups and then, for example, compute descriptive statistics by groups. For example, let’s group our data by country:

gasoline %>% group_by(country)

## # A tibble: 342 x 6
## # Groups:   country [18]
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##  * <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

It looks like nothing much happened, but if you look at the second line of the output you can read the following:

## # Groups:   country [18]

this means that the data is grouped, and every computation you will do now will take these groups into account. It is also possible to group by more than one variable:

gasoline %>% group_by(country, year)

## # A tibble: 342 x 6
## # Groups:   country, year [342]
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##  * <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

and so on. You can then also ungroup:

gasoline %>% group_by(country, year) %>% ungroup()

## # A tibble: 342 x 6
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##  * <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

Once your data is grouped, the operations that will follow will be executed inside each group.

5.3.5 Get summary statistics with `summarise()`

Ok, now that we have learned the basic verbs, we can start to do more interesting stuff. For example, one might want to compute the average gasoline consumption in each country, for the whole period:

gasoline %>%
  group_by(country) %>%
  summarise(mean(lgaspcar))

## # A tibble: 18 x 2
##    country  `mean(lgaspcar)`
##    <fct>               <dbl>
##  1 AUSTRIA              4.06
##  2 BELGIUM              3.92
##  3 CANADA               4.86
##  4 DENMARK              4.19
##  5 FRANCE               3.82
##  6 GERMANY              3.89
##  7 GREECE               4.88
##  8 IRELAND              4.23
##  9 ITALY                3.73
## 10 JAPAN                4.70
## 11 NETHERLA             4.08
## 12 NORWAY               4.11
## 13 SPAIN                4.06
## 14 SWEDEN               4.01
## 15 SWITZERL             4.24
## 16 TURKEY               5.77
## 17 U.K.                 3.98
## 18 U.S.A.               4.82

mean() was given as an argument to summarise(), which is a dplyr verb. What we get is another tibble, that contains the variable we used to group, as well as the average per country. We can also rename this column:

gasoline %>%
  group_by(country) %>%
  summarise(mean_gaspcar = mean(lgaspcar))

## # A tibble: 18 x 2
##    country  mean_gaspcar
##    <fct>           <dbl>
##  1 AUSTRIA          4.06
##  2 BELGIUM          3.92
##  3 CANADA           4.86
##  4 DENMARK          4.19
##  5 FRANCE           3.82
##  6 GERMANY          3.89
##  7 GREECE           4.88
##  8 IRELAND          4.23
##  9 ITALY            3.73
## 10 JAPAN            4.70
## 11 NETHERLA         4.08
## 12 NORWAY           4.11
## 13 SPAIN            4.06
## 14 SWEDEN           4.01
## 15 SWITZERL         4.24
## 16 TURKEY           5.77
## 17 U.K.             3.98
## 18 U.S.A.           4.82

and because the output is a tibble, we can continue to use dplyr verbs on it:

gasoline %>%
  group_by(country) %>%
  summarise(mean_gaspcar = mean(lgaspcar)) %>%
  filter(country == "france")

## # A tibble: 0 x 2
## # ... with 2 variables: country <fct>, mean_gaspcar <dbl>

summarise() is a very useful verb. For example, we can compute several descriptive statistics at once:

gasoline %>%
  group_by(country) %>%
  summarise(mean_gaspcar = mean(lgaspcar),
            sd_gaspcar = sd(lgaspcar),
            max_gaspcar = max(lgaspcar),
            min_gaspcar = min(lgaspcar))

## # A tibble: 18 x 5
##    country  mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar
##    <fct>           <dbl>      <dbl>       <dbl>       <dbl>
##  1 AUSTRIA          4.06     0.0693        4.20        3.92
##  2 BELGIUM          3.92     0.103         4.16        3.82
##  3 CANADA           4.86     0.0262        4.90        4.81
##  4 DENMARK          4.19     0.158         4.50        4.00
##  5 FRANCE           3.82     0.0499        3.91        3.75
##  6 GERMANY          3.89     0.0239        3.93        3.85
##  7 GREECE           4.88     0.255         5.38        4.48
##  8 IRELAND          4.23     0.0437        4.33        4.16
##  9 ITALY            3.73     0.220         4.05        3.38
## 10 JAPAN            4.70     0.684         6.00        3.95
## 11 NETHERLA         4.08     0.286         4.65        3.71
## 12 NORWAY           4.11     0.123         4.44        3.96
## 13 SPAIN            4.06     0.317         4.75        3.62
## 14 SWEDEN           4.01     0.0364        4.07        3.91
## 15 SWITZERL         4.24     0.102         4.44        4.05
## 16 TURKEY           5.77     0.329         6.16        5.14
## 17 U.K.             3.98     0.0479        4.10        3.91
## 18 U.S.A.           4.82     0.0219        4.86        4.79

Because the output is a tibble, you can save it in a variable of course:

desc_gasoline <- gasoline %>%
  group_by(country) %>%
  summarise(mean_gaspcar = mean(lgaspcar),
            sd_gaspcar = sd(lgaspcar),
            max_gaspcar = max(lgaspcar),
            min_gaspcar = min(lgaspcar))

And then you can answer questions such as, which country has the maximum average gasoline consumption?:

desc_gasoline %>%
  filter(max(mean_gaspcar) == mean_gaspcar)

## # A tibble: 1 x 5
##   country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar
##   <fct>          <dbl>      <dbl>       <dbl>       <dbl>
## 1 TURKEY          5.77      0.329        6.16        5.14

Turns out it’s Turkey. What about the minimum consumption?

desc_gasoline %>%
  filter(min(mean_gaspcar) == mean_gaspcar)

## # A tibble: 1 x 5
##   country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar
##   <fct>          <dbl>      <dbl>       <dbl>       <dbl>
## 1 ITALY           3.73      0.220        4.05        3.38

Because the output of dplyr verbs is a tibble, it is possible to continue working with it. This is one shortcoming of using the base summary() function. The object returned by that function is not very easy to manipulate.

5.3.6 Adding columns with `mutate()` and `transmute()`

mutate() adds a column to the tibble, which can contain any transformation of any other variable:

gasoline %>%
  group_by(country) %>%
  mutate(n())

## # A tibble: 342 x 7
## # Groups:   country [18]
##    country  year lgaspcar lincomep  lrpmg lcarpcap `n()`
##    <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl> <int>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77    19
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61    19
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46    19
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34    19
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24    19
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12    19
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02    19
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93    19
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85    19
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79    19
## # ... with 332 more rows

Using mutate() I’ve added a column that counts how many times the country appears in the tibble, using n(), another dplyr function. There’s also count() and tally(), which we are going to see further down. It is also possible to rename the column on the fly:

gasoline %>%
  group_by(country) %>%
  mutate(count = n())

## # A tibble: 342 x 7
## # Groups:   country [18]
##    country  year lgaspcar lincomep  lrpmg lcarpcap count
##    <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl> <int>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77    19
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61    19
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46    19
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34    19
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24    19
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12    19
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02    19
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93    19
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85    19
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79    19
## # ... with 332 more rows

It is possible to do any arbitrary operation:

gasoline %>%
  group_by(country) %>%
  mutate(spam = exp(lgaspcar + lincomep))

## # A tibble: 342 x 7
## # Groups:   country [18]
##    country  year lgaspcar lincomep  lrpmg lcarpcap   spam
##    <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77 0.100 
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61 0.0978
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46 0.0969
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34 0.0991
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24 0.102 
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12 0.104 
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02 0.110 
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93 0.113 
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85 0.115 
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79 0.122 
## # ... with 332 more rows

transmute() is the same as mutate(), but only returns the created variable:

gasoline %>%
  group_by(country) %>%
  transmute(spam = exp(lgaspcar + lincomep))

## # A tibble: 342 x 2
## # Groups:   country [18]
##    country   spam
##    <fct>    <dbl>
##  1 AUSTRIA 0.100 
##  2 AUSTRIA 0.0978
##  3 AUSTRIA 0.0969
##  4 AUSTRIA 0.0991
##  5 AUSTRIA 0.102 
##  6 AUSTRIA 0.104 
##  7 AUSTRIA 0.110 
##  8 AUSTRIA 0.113 
##  9 AUSTRIA 0.115 
## 10 AUSTRIA 0.122 
## # ... with 332 more rows

5.3.7 Joining `tibble`s with `full_join()`, `left_join()`, `right_join()` and all the others

I will end this section on dplyr with the very useful verbs: the *_join() verbs. Let’s first start by loading another dataset from the plm package. SumHes and let’s convert it to tibble and rename it:

data(SumHes, package = "plm")

pwt <- SumHes %>%
  as_tibble() %>%
  mutate(country = tolower(country))

Let’s take a quick look at the data:

glimpse(pwt)

## Observations: 3,250
## Variables: 7
## $ year    <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, ...
## $ country <chr> "algeria", "algeria", "algeria", "algeria", "algeria",...
## $ opec    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no...
## $ com     <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no...
## $ pop     <int> 10800, 11016, 11236, 11460, 11690, 11923, 12267, 12622...
## $ gdp     <int> 1723, 1599, 1275, 1517, 1589, 1584, 1548, 1600, 1758, ...
## $ sr      <dbl> 19.9, 21.1, 15.0, 13.9, 10.6, 11.0, 8.3, 11.3, 15.1, 1...

We can merge both gasoline and pwt by country and year, as these two variables are common to both datasets. There are more countries and years in the pwt dataset, so when merging both, and depending on which function you use, you will either have NA’s for the variables where there is no match, or rows that will be dropped. Let’s start with full_join:

gas_pwt_full <- gasoline %>%
  full_join(pwt, by = c("country", "year"))

## Warning: Column `country` joining factor and character vector, coercing
## into character vector

Let’s see which countries and years are included:

gas_pwt_full %>%
  count(country, year)

## # A tibble: 3,592 x 3
##    country  year     n
##    <chr>   <int> <int>
##  1 algeria  1960     1
##  2 algeria  1961     1
##  3 algeria  1962     1
##  4 algeria  1963     1
##  5 algeria  1964     1
##  6 algeria  1965     1
##  7 algeria  1966     1
##  8 algeria  1967     1
##  9 algeria  1968     1
## 10 algeria  1969     1
## # ... with 3,582 more rows

As you see, every country and year was included, but what happened for, say, the U.S.S.R? This country is in pwt but not in gasoline at all:

gas_pwt_full %>%
  filter(country == "u.s.s.r.")

## # A tibble: 26 x 11
##    country  year lgaspcar lincomep lrpmg lcarpcap opec  com      pop   gdp
##    <chr>   <int>    <dbl>    <dbl> <dbl>    <dbl> <fct> <fct>  <int> <int>
##  1 u.s.s.…  1960       NA       NA    NA       NA no    yes   214400  2397
##  2 u.s.s.…  1961       NA       NA    NA       NA no    yes   217896  2542
##  3 u.s.s.…  1962       NA       NA    NA       NA no    yes   221449  2656
##  4 u.s.s.…  1963       NA       NA    NA       NA no    yes   225060  2681
##  5 u.s.s.…  1964       NA       NA    NA       NA no    yes   227571  2854
##  6 u.s.s.…  1965       NA       NA    NA       NA no    yes   230109  3049
##  7 u.s.s.…  1966       NA       NA    NA       NA no    yes   232676  3247
##  8 u.s.s.…  1967       NA       NA    NA       NA no    yes   235272  3454
##  9 u.s.s.…  1968       NA       NA    NA       NA no    yes   237896  3730
## 10 u.s.s.…  1969       NA       NA    NA       NA no    yes   240550  3808
## # ... with 16 more rows, and 1 more variable: sr <dbl>

As you probably guessed, the variables from gasoline that are not included in pwt are filled with NAs. One could remove all these lines and only keep countries for which these variables are not NA everywhere with filter(), but there is a simpler solution:

gas_pwt_inner <- gasoline %>%
  inner_join(pwt, by = c("country", "year"))

## Warning: Column `country` joining factor and character vector, coercing
## into character vector

Let’s use the tabyl() from the janitor packages which is a very nice alternative to the table() function from base R:

library(janitor)

gas_pwt_inner %>%
  tabyl(country)

## [1] country n       percent
## <0 rows> (or 0-length row.names)

Only countries with values in both datasets were returned. It’s almost every country from gasoline, apart from Germany (called “germany west” in pwt and “germany” in gasoline. I left it as is to provide an example of a country not in pwt). Let’s also look at the variables:

glimpse(gas_pwt_inner)

## Observations: 0
## Variables: 11
## $ country  <chr> 
## $ year     <int> 
## $ lgaspcar <dbl> 
## $ lincomep <dbl> 
## $ lrpmg    <dbl> 
## $ lcarpcap <dbl> 
## $ opec     <fct> 
## $ com      <fct> 
## $ pop      <int> 
## $ gdp      <int> 
## $ sr       <dbl>

The variables from both datasets are in the joined data.

Contrast this to semi_join():

gas_pwt_semi <- gasoline %>%
  semi_join(pwt, by = c("country", "year"))

## Warning: Column `country` joining factor and character vector, coercing
## into character vector

glimpse(gas_pwt_semi)

## Observations: 0
## Variables: 6
## $ country  <fct> 
## $ year     <int> 
## $ lgaspcar <dbl> 
## $ lincomep <dbl> 
## $ lrpmg    <dbl> 
## $ lcarpcap <dbl>

gas_pwt_semi %>%
  tabyl(country)

##   country n percent
##   AUSTRIA 0     NaN
##   BELGIUM 0     NaN
##    CANADA 0     NaN
##   DENMARK 0     NaN
##    FRANCE 0     NaN
##   GERMANY 0     NaN
##    GREECE 0     NaN
##   IRELAND 0     NaN
##     ITALY 0     NaN
##     JAPAN 0     NaN
##  NETHERLA 0     NaN
##    NORWAY 0     NaN
##     SPAIN 0     NaN
##    SWEDEN 0     NaN
##  SWITZERL 0     NaN
##    TURKEY 0     NaN
##      U.K. 0     NaN
##    U.S.A. 0     NaN

Only columns of gasoline are returned, and only rows of gasoline that were matched with rows from pwt. semi_join() is not a commutative operation:

pwt_gas_semi <- pwt %>%
  semi_join(gasoline, by = c("country", "year"))

## Warning: Column `country` joining character vector and factor, coercing
## into character vector

glimpse(pwt_gas_semi)

## Observations: 0
## Variables: 7
## $ year    <int> 
## $ country <chr> 
## $ opec    <fct> 
## $ com     <fct> 
## $ pop     <int> 
## $ gdp     <int> 
## $ sr      <dbl>

gas_pwt_semi %>%
  tabyl(country)

##   country n percent
##   AUSTRIA 0     NaN
##   BELGIUM 0     NaN
##    CANADA 0     NaN
##   DENMARK 0     NaN
##    FRANCE 0     NaN
##   GERMANY 0     NaN
##    GREECE 0     NaN
##   IRELAND 0     NaN
##     ITALY 0     NaN
##     JAPAN 0     NaN
##  NETHERLA 0     NaN
##    NORWAY 0     NaN
##     SPAIN 0     NaN
##    SWEDEN 0     NaN
##  SWITZERL 0     NaN
##    TURKEY 0     NaN
##      U.K. 0     NaN
##    U.S.A. 0     NaN

The rows are the same, but not the columns.

left_join() and right_join() return all the rows from either the dataset that is on the “left” (the first argument of the fonction) or on the “right” (the second argument of the function) but all columns from both datasets. So depending on which countries you’re interested in, you’re going to use either one of these functions:

gas_pwt_left <- gasoline %>%
  left_join(pwt, by = c("country", "year"))

## Warning: Column `country` joining factor and character vector, coercing
## into character vector

gas_pwt_left %>%
  tabyl(country)

##   country  n    percent
##   AUSTRIA 19 0.05555556
##   BELGIUM 19 0.05555556
##    CANADA 19 0.05555556
##   DENMARK 19 0.05555556
##    FRANCE 19 0.05555556
##   GERMANY 19 0.05555556
##    GREECE 19 0.05555556
##   IRELAND 19 0.05555556
##     ITALY 19 0.05555556
##     JAPAN 19 0.05555556
##  NETHERLA 19 0.05555556
##    NORWAY 19 0.05555556
##     SPAIN 19 0.05555556
##    SWEDEN 19 0.05555556
##  SWITZERL 19 0.05555556
##    TURKEY 19 0.05555556
##      U.K. 19 0.05555556
##    U.S.A. 19 0.05555556

gas_pwt_right <- gasoline %>%
  right_join(pwt, by = c("country", "year"))

## Warning: Column `country` joining factor and character vector, coercing
## into character vector

gas_pwt_right %>%
  tabyl(country)

##          country  n percent
##          algeria 26   0.008
##           angola 26   0.008
##        argentina 26   0.008
##        australia 26   0.008
##          austria 26   0.008
##       bangladesh 26   0.008
##         barbados 26   0.008
##          belgium 26   0.008
##            benin 26   0.008
##          bolivia 26   0.008
##         botswana 26   0.008
##           brazil 26   0.008
##     burkina faso 26   0.008
##          burundi 26   0.008
##         cameroon 26   0.008
##           canada 26   0.008
##   cape verde is. 26   0.008
##   central afr.r. 26   0.008
##             chad 26   0.008
##            chile 26   0.008
##            china 26   0.008
##         colombia 26   0.008
##          comoros 26   0.008
##            congo 26   0.008
##       costa rica 26   0.008
##           cyprus 26   0.008
##   czechoslovakia 26   0.008
##          denmark 26   0.008
##   dominican rep. 26   0.008
##          ecuador 26   0.008
##            egypt 26   0.008
##      el salvador 26   0.008
##         ethiopia 26   0.008
##             fiji 26   0.008
##          finland 26   0.008
##           france 26   0.008
##            gabon 26   0.008
##           gambia 26   0.008
##     germany west 26   0.008
##            ghana 26   0.008
##           greece 26   0.008
##        guatemala 26   0.008
##           guinea 26   0.008
##      guinea-biss 26   0.008
##           guyana 26   0.008
##            haiti 26   0.008
##         honduras 26   0.008
##        hong kong 26   0.008
##          iceland 26   0.008
##            india 26   0.008
##        indonesia 26   0.008
##             iran 26   0.008
##             iraq 26   0.008
##          ireland 26   0.008
##           israel 26   0.008
##            italy 26   0.008
##      ivory coast 26   0.008
##          jamaica 26   0.008
##            japan 26   0.008
##           jordan 26   0.008
##            kenya 26   0.008
##            korea 26   0.008
##          lesotho 26   0.008
##          liberia 26   0.008
##       luxembourg 26   0.008
##       madagascar 26   0.008
##           malawi 26   0.008
##         malaysia 26   0.008
##             mali 26   0.008
##            malta 26   0.008
##       mauritania 26   0.008
##        mauritius 26   0.008
##           mexico 26   0.008
##          morocco 26   0.008
##       mozambique 26   0.008
##          myanmar 26   0.008
##          namibia 26   0.008
##            nepal 26   0.008
##      netherlands 26   0.008
##      new zealand 26   0.008
##        nicaragua 26   0.008
##            niger 26   0.008
##          nigeria 26   0.008
##           norway 26   0.008
##         pakistan 26   0.008
##           panama 26   0.008
##  papua n.guinea  26   0.008
##         paraguay 26   0.008
##             peru 26   0.008
##      philippines 26   0.008
##         portugal 26   0.008
##      puerto rico 26   0.008
##          reunion 26   0.008
##          romania 26   0.008
##           rwanda 26   0.008
##     saudi arabia 26   0.008
##          senegal 26   0.008
##       seychelles 26   0.008
##        singapore 26   0.008
##          somalia 26   0.008
##     south africa 26   0.008
##            spain 26   0.008
##        sri lanka 26   0.008
##         suriname 26   0.008
##        swaziland 26   0.008
##           sweden 26   0.008
##      switzerland 26   0.008
##            syria 26   0.008
##           taiwan 26   0.008
##         tanzania 26   0.008
##         thailand 26   0.008
##             togo 26   0.008
##  trinidad&tobago 26   0.008
##          tunisia 26   0.008
##           turkey 26   0.008
##             u.k. 26   0.008
##           u.s.a. 26   0.008
##         u.s.s.r. 26   0.008
##           uganda 26   0.008
##          uruguay 26   0.008
##        venezuela 26   0.008
##       yugoslavia 26   0.008
##            zaire 26   0.008
##           zambia 26   0.008
##         zimbabwe 26   0.008

The last merge function is anti_join():

gas_pwt_anti <- gasoline %>%
  anti_join(pwt, by = c("country", "year"))

## Warning: Column `country` joining factor and character vector, coercing
## into character vector

glimpse(gas_pwt_anti)

## Observations: 342
## Variables: 6
## $ country  <fct> AUSTRIA, AUSTRIA, AUSTRIA, AUSTRIA, AUSTRIA, AUSTRIA,...
## $ year     <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968,...
## $ lgaspcar <dbl> 4.173244, 4.100989, 4.073177, 4.059509, 4.037689, 4.0...
## $ lincomep <dbl> -6.474277, -6.426006, -6.407308, -6.370679, -6.322247...
## $ lrpmg    <dbl> -0.3345476, -0.3513276, -0.3795177, -0.4142514, -0.44...
## $ lcarpcap <dbl> -9.766840, -9.608622, -9.457257, -9.343155, -9.237739...

gas_pwt_anti %>%
  tabyl(country)

##   country  n    percent
##   AUSTRIA 19 0.05555556
##   BELGIUM 19 0.05555556
##    CANADA 19 0.05555556
##   DENMARK 19 0.05555556
##    FRANCE 19 0.05555556
##   GERMANY 19 0.05555556
##    GREECE 19 0.05555556
##   IRELAND 19 0.05555556
##     ITALY 19 0.05555556
##     JAPAN 19 0.05555556
##  NETHERLA 19 0.05555556
##    NORWAY 19 0.05555556
##     SPAIN 19 0.05555556
##    SWEDEN 19 0.05555556
##  SWITZERL 19 0.05555556
##    TURKEY 19 0.05555556
##      U.K. 19 0.05555556
##    U.S.A. 19 0.05555556

gas_pwt_anti has the columns the gasoline dataset as well as the only country from gasoline that is not in pwt: “germany”.

That was it for the basic {dplyr} verbs. Next, we’re going to learn about {tidyr}.

5.4 Reshaping data with `tidyr`

Another important package from the tidyverse that goes hand in hand with dplyr is tidyr. tidyr is the package you need when it’s time to reshape data. The basic functions from tidyr, spread() and gather() make it possible to go from long to wide datasets respectively.

5.4.1 `spread()` and `gather()`

Let’s first create a fake dataset:

library(tidyr)

survey_data <- tribble(
  ~id, ~variable, ~value,
  1, "var1", 1,
  1, "var2", 0.2,
  NA, "var3", 0.3,
  2, "var1", 1.4,
  2, "var2", 1.9,
  2, "var3", 4.1,
  3, "var1", 0.1,
  3, "var2", 2.8,
  3, "var3", 8.9,
  4, "var1", 1.7,
  NA, "var2", 1.9,
  4, "var3", 7.6
)

head(survey_data)

## # A tibble: 6 x 3
##      id variable value
##   <dbl> <chr>    <dbl>
## 1     1 var1       1  
## 2     1 var2       0.2
## 3    NA var3       0.3
## 4     2 var1       1.4
## 5     2 var2       1.9
## 6     2 var3       4.1

I used the tribble() function from the {tibble} package to create this fake dataset. I’ll discuss this package later, for now, let’s focus on {tidyr}.

survey_data is a long dataset. We can reshape it to be wide using the spread() function:

wide_data <- survey_data %>%
  spread(variable, value)

head(wide_data)

## # A tibble: 5 x 4
##      id  var1  var2  var3
##   <dbl> <dbl> <dbl> <dbl>
## 1     1   1     0.2  NA  
## 2     2   1.4   1.9   4.1
## 3     3   0.1   2.8   8.9
## 4     4   1.7  NA     7.6
## 5    NA  NA     1.9   0.3

This means that we spread the column called “variable”, which will produce one column per category of “variable”. Then we fill in the rows with the data contained in the column “value”.

To go from a wide dataset to a long one, we use gather():

long_data <- wide_data %>%
  gather(variable, value, var1, var2)

print(long_data)

## # A tibble: 10 x 4
##       id  var3 variable value
##    <dbl> <dbl> <chr>    <dbl>
##  1     1  NA   var1       1  
##  2     2   4.1 var1       1.4
##  3     3   8.9 var1       0.1
##  4     4   7.6 var1       1.7
##  5    NA   0.3 var1      NA  
##  6     1  NA   var2       0.2
##  7     2   4.1 var2       1.9
##  8     3   8.9 var2       2.8
##  9     4   7.6 var2      NA  
## 10    NA   0.3 var2       1.9

long_data and survey_data are the same datasets, but in a different order.

In the wide_data tibble, we had 3 columns: id, var1 and var2. We want to stack ‘var1’ and ‘var2’ in a new column, that we choose to call “variable”. This is the “key”. For the value, we are using the values contained in var1 and var2. Sometimes using spread() or gather() requires some trial and error. I advise you play around with the examples above to really grasp how these powerful functions work.

5.4.2 `fill()` and `full_seq()`

fill() is pretty useful to… fill in missing values. For instance, in survey_data, some “id”s are missing:

survey_data

## # A tibble: 12 x 3
##       id variable value
##    <dbl> <chr>    <dbl>
##  1     1 var1       1  
##  2     1 var2       0.2
##  3    NA var3       0.3
##  4     2 var1       1.4
##  5     2 var2       1.9
##  6     2 var3       4.1
##  7     3 var1       0.1
##  8     3 var2       2.8
##  9     3 var3       8.9
## 10     4 var1       1.7
## 11    NA var2       1.9
## 12     4 var3       7.6

It seems pretty obvious that the first NA is supposed to be 1 and the second missing is supposed to be 4. With fill(), this is pretty easy to achieve:

survey_data %>% 
    fill(.direction = "down", variable = "id")

## # A tibble: 12 x 3
##       id variable value
##    <dbl> <chr>    <dbl>
##  1     1 var1       1  
##  2     1 var2       0.2
##  3     1 var3       0.3
##  4     2 var1       1.4
##  5     2 var2       1.9
##  6     2 var3       4.1
##  7     3 var1       0.1
##  8     3 var2       2.8
##  9     3 var3       8.9
## 10     4 var1       1.7
## 11     4 var2       1.9
## 12     4 var3       7.6

full_seq() is similar:

full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1)

## [1] "2018-08-01" "2018-08-02" "2018-08-03"

We can add this as the date column to our survey data:

survey_data %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4))

## # A tibble: 12 x 4
##       id variable value date      
##    <dbl> <chr>    <dbl> <date>    
##  1     1 var1       1   2018-08-01
##  2     1 var2       0.2 2018-08-02
##  3    NA var3       0.3 2018-08-03
##  4     2 var1       1.4 2018-08-01
##  5     2 var2       1.9 2018-08-02
##  6     2 var3       4.1 2018-08-03
##  7     3 var1       0.1 2018-08-01
##  8     3 var2       2.8 2018-08-02
##  9     3 var3       8.9 2018-08-03
## 10     4 var1       1.7 2018-08-01
## 11    NA var2       1.9 2018-08-02
## 12     4 var3       7.6 2018-08-03

I use the base rep() function to repeat the date 4 times and then using mutate() I have added it the data frame.

Putting all these operations together:

survey_data %>% 
    fill(.direction = "down", variable = "id") %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
    spread(variable, value)

## # A tibble: 12 x 5
##       id date        var1  var2  var3
##    <dbl> <date>     <dbl> <dbl> <dbl>
##  1     1 2018-08-01   1    NA    NA  
##  2     1 2018-08-02  NA     0.2  NA  
##  3     1 2018-08-03  NA    NA     0.3
##  4     2 2018-08-01   1.4  NA    NA  
##  5     2 2018-08-02  NA     1.9  NA  
##  6     2 2018-08-03  NA    NA     4.1
##  7     3 2018-08-01   0.1  NA    NA  
##  8     3 2018-08-02  NA     2.8  NA  
##  9     3 2018-08-03  NA    NA     8.9
## 10     4 2018-08-01   1.7  NA    NA  
## 11     4 2018-08-02  NA     1.9  NA  
## 12     4 2018-08-03  NA    NA     7.6

As you can see, this creates a lot of explicit NA values. The best would be to fill in the missing values and add the date column, and then work with that format. For example, to get the average of the values by date:

survey_data %>% 
    fill(.direction = "down", variable = "id") %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
    group_by(date) %>%
    summarise(mean(value))

## # A tibble: 3 x 2
##   date       `mean(value)`
##   <date>             <dbl>
## 1 2018-08-01          1.05
## 2 2018-08-02          1.7 
## 3 2018-08-03          5.22

Or by date and variable:

survey_data %>% 
    fill(.direction = "down", variable = "id") %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
    group_by(date, variable) %>%
    summarise(mean(value))

## # A tibble: 3 x 3
## # Groups:   date [?]
##   date       variable `mean(value)`
##   <date>     <chr>            <dbl>
## 1 2018-08-01 var1              1.05
## 2 2018-08-02 var2              1.7 
## 3 2018-08-03 var3              5.22

As you can see, you can chain any {tidyverse} verbs, wether they come from {dplyr} or {tidyr}.

5.4.3 Put order in your columns with `separate()`, `unite()`, and in your rows with `separate_rows()`

Sometimes, data can be in a format that makes working with it needlessly painful. For example, you get this:

survey_data_not_tidy

## # A tibble: 12 x 3
##       id variable_date   value
##    <dbl> <chr>           <dbl>
##  1     1 var1/2018-08-01   1  
##  2     1 var2/2018-08-02   0.2
##  3     1 var3/2018-08-03   0.3
##  4     2 var1/2018-08-01   1.4
##  5     2 var2/2018-08-02   1.9
##  6     2 var3/2018-08-03   4.1
##  7     3 var1/2018-08-01   0.1
##  8     3 var2/2018-08-02   2.8
##  9     3 var3/2018-08-03   8.9
## 10     4 var1/2018-08-01   1.7
## 11     4 var2/2018-08-02   1.9
## 12     4 var3/2018-08-03   7.6

Dealing with this is simple, thanks to separate():

survey_data_not_tidy %>% 
    separate(variable_date, into = c("variable", "date"), sep = "/")

## # A tibble: 12 x 4
##       id variable date       value
##    <dbl> <chr>    <chr>      <dbl>
##  1     1 var1     2018-08-01   1  
##  2     1 var2     2018-08-02   0.2
##  3     1 var3     2018-08-03   0.3
##  4     2 var1     2018-08-01   1.4
##  5     2 var2     2018-08-02   1.9
##  6     2 var3     2018-08-03   4.1
##  7     3 var1     2018-08-01   0.1
##  8     3 var2     2018-08-02   2.8
##  9     3 var3     2018-08-03   8.9
## 10     4 var1     2018-08-01   1.7
## 11     4 var2     2018-08-02   1.9
## 12     4 var3     2018-08-03   7.6

The variable_date column gets separated into two columns, variable and date. One also needs to specify the separator, in this case “/”.

unite() is the reverse operation, which can be useful when you are confronted to this situation:

survey_data2

## # A tibble: 12 x 6
##       id variable year  month day   value
##    <dbl> <chr>    <chr> <chr> <chr> <dbl>
##  1     1 var1     2018  08    01      1  
##  2     1 var2     2018  08    02      0.2
##  3     1 var3     2018  08    03      0.3
##  4     2 var1     2018  08    01      1.4
##  5     2 var2     2018  08    02      1.9
##  6     2 var3     2018  08    03      4.1
##  7     3 var1     2018  08    01      0.1
##  8     3 var2     2018  08    02      2.8
##  9     3 var3     2018  08    03      8.9
## 10     4 var1     2018  08    01      1.7
## 11     4 var2     2018  08    02      1.9
## 12     4 var3     2018  08    03      7.6

In some situation, it is better to have the date as a single column:

survey_data2 %>%
    unite(date, year, month, day, sep = "-")

## # A tibble: 12 x 4
##       id variable date       value
##    <dbl> <chr>    <chr>      <dbl>
##  1     1 var1     2018-08-01   1  
##  2     1 var2     2018-08-02   0.2
##  3     1 var3     2018-08-03   0.3
##  4     2 var1     2018-08-01   1.4
##  5     2 var2     2018-08-02   1.9
##  6     2 var3     2018-08-03   4.1
##  7     3 var1     2018-08-01   0.1
##  8     3 var2     2018-08-02   2.8
##  9     3 var3     2018-08-03   8.9
## 10     4 var1     2018-08-01   1.7
## 11     4 var2     2018-08-02   1.9
## 12     4 var3     2018-08-03   7.6

Another awful situation is the following:

survey_data_from_hell

##   id         variable         value
## 1  1             var1             1
## 2  1             var2           0.2
## 3 NA             var3           0.3
## 4  2 var1, var2, var3 1.4, 1.9, 4.1
## 5  3       var1, var2      0.1, 2.8
## 6  3             var3           8.9
## 7  4             var1           1.7
## 8 NA             var2           1.9
## 9  4             var3           7.6

separate_rows() saves the day:

survey_data_from_hell %>%
    separate_rows(variable, value)

##    id variable value
## 1   1     var1     1
## 2   1     var2   0.2
## 3  NA     var3   0.3
## 4   2     var1   1.4
## 5   2     var2   1.9
## 6   2     var3   4.1
## 7   3     var1   0.1
## 8   3     var2   2.8
## 9   3     var3   8.9
## 10  4     var1   1.7
## 11 NA     var2   1.9
## 12  4     var3   7.6

So to summarise… you can go from this:

survey_data_from_hell

##   id         variable         value
## 1  1             var1             1
## 2  1             var2           0.2
## 3 NA             var3           0.3
## 4  2 var1, var2, var3 1.4, 1.9, 4.1
## 5  3       var1, var2      0.1, 2.8
## 6  3             var3           8.9
## 7  4             var1           1.7
## 8 NA             var2           1.9
## 9  4             var3           7.6

to this:

survey_data_clean

## # A tibble: 12 x 4
##       id variable date       value
##    <dbl> <chr>    <chr>      <dbl>
##  1     1 var1     2018-08-01   1  
##  2     1 var2     2018-08-02   0.2
##  3     1 var3     2018-08-03   0.3
##  4     2 var1     2018-08-01   1.4
##  5     2 var2     2018-08-02   1.9
##  6     2 var3     2018-08-03   4.1
##  7     3 var1     2018-08-01   0.1
##  8     3 var2     2018-08-02   2.8
##  9     3 var3     2018-08-03   8.9
## 10     4 var1     2018-08-01   1.7
## 11     4 var2     2018-08-02   1.9
## 12     4 var3     2018-08-03   7.6

quite easily:

survey_data_from_hell %>% 
    separate_rows(variable, value, convert = TRUE) %>%
    fill(.direction = "down", variable = "id") %>%
    mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4))

##    id variable value       date
## 1   1     var1   1.0 2018-08-01
## 2   1     var2   0.2 2018-08-02
## 3   1     var3   0.3 2018-08-03
## 4   2     var1   1.4 2018-08-01
## 5   2     var2   1.9 2018-08-02
## 6   2     var3   4.1 2018-08-03
## 7   3     var1   0.1 2018-08-01
## 8   3     var2   2.8 2018-08-02
## 9   3     var3   8.9 2018-08-03
## 10  4     var1   1.7 2018-08-01
## 11  4     var2   1.9 2018-08-02
## 12  4     var3   7.6 2018-08-03

5.4.4 A sneak peek to Chapter 9; `nest()`

Let’s take a look at our clean survey data:

survey_data_clean

## # A tibble: 12 x 4
##       id variable date       value
##    <dbl> <chr>    <chr>      <dbl>
##  1     1 var1     2018-08-01   1  
##  2     1 var2     2018-08-02   0.2
##  3     1 var3     2018-08-03   0.3
##  4     2 var1     2018-08-01   1.4
##  5     2 var2     2018-08-02   1.9
##  6     2 var3     2018-08-03   4.1
##  7     3 var1     2018-08-01   0.1
##  8     3 var2     2018-08-02   2.8
##  9     3 var3     2018-08-03   8.9
## 10     4 var1     2018-08-01   1.7
## 11     4 var2     2018-08-02   1.9
## 12     4 var3     2018-08-03   7.6

{tidyr} has another very useful function, called nest():

survey_data_clean %>%
    nest(-id)

## # A tibble: 4 x 2
##      id data            
##   <dbl> <list>          
## 1     1 <tibble [3 × 3]>
## 2     2 <tibble [3 × 3]>
## 3     3 <tibble [3 × 3]>
## 4     4 <tibble [3 × 3]>

You can achieve the same result by using group_by() first:

survey_data_clean %>%
    group_by(id) %>%
    nest()

## # A tibble: 4 x 2
##      id data            
##   <dbl> <list>          
## 1     1 <tibble [3 × 3]>
## 2     2 <tibble [3 × 3]>
## 3     3 <tibble [3 × 3]>
## 4     4 <tibble [3 × 3]>

This cerates a new tibble object, whith two columns, one with the id column, and a new one called data. Let’s take a look at the first element of this column:

nested_survey_data <- survey_data_clean %>%
    group_by(id) %>%
    nest()

nested_survey_data$data[1]

## [[1]]
## # A tibble: 3 x 3
##   variable date       value
##   <chr>    <chr>      <dbl>
## 1 var1     2018-08-01   1  
## 2 var2     2018-08-02   0.2
## 3 var3     2018-08-03   0.3

As you can see, the first element of the data is also a tibble! You may wonder why this is useful; I will give you a small taste of what’s waiting in chapter 9. Imagine that you want to create a barplot for each id. For instance, for the first id (we are going to learn about making plots with {ggplot2} in the next chapter. For now, just follow along, even if you don’t understand everything I write):

survey_data_id1 <- survey_data_clean %>%
    filter(id == 1)

Now, let’s create the plot:

ggplot(data = survey_data_id1) +
    geom_bar(aes(y = value, x = date, fill = variable), stat = "identity") +
    ggtitle("id 1")

Ok great. But now I want to create this plot for each id… so I have to copy paste this 3 times. But copy-pasting is error prone. So there are two alternatives; I either write a function that takes as argument the data and the id I want to plot and run it for each id (we will learn to do this in Chapter 8), or I can use tidyr::nest(), combined with purrr::map(). {purrr} is another very useful {tidevyrse} package, and we are going to learn about it in Chapter 9. Again, just follow along for now:

my_plots <- nested_survey_data %>%
    mutate(plot = map2(.x = id, 
                       .y = data,
                       ~ggplot(data = .y) +
                           geom_bar(aes(y = value, x = date, fill = variable), stat = "identity") +
                           ggtitle(paste0("id", .x))))

This is some very advanced stuff, and again, do not worry if you don’t understand everything now. We are going to learn about this in detail in Chapter 9. Let’s go through each line. In the first line, I have started from my clean data nested_survey_data and then, using the %>% and mutate() I create a new column called plot. Inside the mutate() function, I called map2. map2 is a {purrr} function that takes three inputs: .x, .y and a function. .x is the id column from my data, and .y is the data column from nested_survey_data. The function is the ggplot I created before. Think of map2() has a loop over two lists, all while applying a function.

The following illustration, taken from Vaudor (2018) by Lise Vaudor, illustrates this perfectly:

Two inputs go in at the same type, the factory, your function, does what it has to do, and an output comes out. Forget about the factory’s chimney, which represents walk2() for now. We’ll learn about it in due time. By the way, you really should read Lise’s blog, her posts are really great and informative. It’s in French, but that’s not a problem, you know how to read R code, right? Here’s a link to her blog.

Now, let’s take a look at the result:

my_plots

## # A tibble: 4 x 3
##      id data             plot    
##   <dbl> <list>           <list>  
## 1     1 <tibble [3 × 3]> <S3: gg>
## 2     2 <tibble [3 × 3]> <S3: gg>
## 3     3 <tibble [3 × 3]> <S3: gg>
## 4     4 <tibble [3 × 3]> <S3: gg>

my_plots is a tibble with three columns: id, data and.. plot! plot is a very interesting column. It is a list, where each element is of type S3: gg. Yes, you guessed it, each element of that list-column is a ggplot! If you now want to take a look at the plots, it is enough to do this:

my_plots$plot

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

Ok, that was quite complicated, but again, this was only to introduce nest() and give you a taste of the power of the {tidyverse}. By the end of Chapter 9, you’ll be used to this.

That was it for a first introduction to {tidyr}. Now, it’s time to learn about scoped verbs.

5.5 Scoped `{tidyverse}` verbs

Scoped verbs are special versions of the verbs you are now familiar with.

5.5.1 filter_*()

Let’s go back to the gasoline data from the {plm} package.

filter() is not the only filtering verb there is. Suppose that we have a condition that we want to use to filter out a lot of columns at once. For example, for every column that is of type numeric, keep only the lines where the condition value > -8 is satisfied. The next line does that:

gasoline %>% filter_if( ~all(is.numeric(.)), all_vars(. > -8))

## # A tibble: 30 x 6
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##    <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 CANADA   1972     4.89    -5.44 -1.10     -7.99
##  2 CANADA   1973     4.90    -5.41 -1.13     -7.94
##  3 CANADA   1974     4.89    -5.42 -1.12     -7.90
##  4 CANADA   1975     4.89    -5.38 -1.19     -7.87
##  5 CANADA   1976     4.84    -5.36 -1.06     -7.81
##  6 CANADA   1977     4.81    -5.34 -1.07     -7.77
##  7 CANADA   1978     4.86    -5.31 -1.07     -7.79
##  8 GERMANY  1978     3.88    -5.56 -0.628    -7.95
##  9 SWEDEN   1975     3.97    -7.68 -2.77     -7.99
## 10 SWEDEN   1976     3.98    -7.67 -2.82     -7.96
## # ... with 20 more rows

This is a bit more complicated than before, but let’s go through it, step by step. filter_if() needs 3 arguments to work; the data, a predicate function (a function that returns TRUE, or FALSE) which will select the columns we want to work on, and then the condition. The condition can be applied to all the columns that were selected by the predicate function (hence the all_vars()) or only to at least one (you’d use any_vars() then). Try to change the condition, or the predicate function, to figure out how filter_if() works. The dot is a placeholder that stands for whatever columns where selected.

Another scoped filter() verb is filter_at(); it allows the user to filter columns by position:

gasoline %>% filter_at(vars(ends_with("p")), all_vars(. > -8))

## # A tibble: 30 x 6
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##    <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 CANADA   1972     4.89    -5.44 -1.10     -7.99
##  2 CANADA   1973     4.90    -5.41 -1.13     -7.94
##  3 CANADA   1974     4.89    -5.42 -1.12     -7.90
##  4 CANADA   1975     4.89    -5.38 -1.19     -7.87
##  5 CANADA   1976     4.84    -5.36 -1.06     -7.81
##  6 CANADA   1977     4.81    -5.34 -1.07     -7.77
##  7 CANADA   1978     4.86    -5.31 -1.07     -7.79
##  8 GERMANY  1978     3.88    -5.56 -0.628    -7.95
##  9 SWEDEN   1975     3.97    -7.68 -2.77     -7.99
## 10 SWEDEN   1976     3.98    -7.67 -2.82     -7.96
## # ... with 20 more rows

we already know about ends_with() and starts_with(). So the above line means “for the columns whose name end with a ‘p’ only keep the lines where, for all the selected columns, the values are strictly superior to -8”. Again, this is not very easy the first time you deal with that, so play around with it for a bit.

Finally, there is also filter_all() which, as the name implies, uses all the variables for the filtering step.

filter_if() and filter_at() are very useful when you have very large datasets with a lot of variables and you want to apply a filtering function only to a subset of them. filter_all() is useful if, for example, you only want to keep the positive values for all the columns.

5.5.2 select_() and rename_()

Just as with filter(), select() also has scoped versions. select_if() makes it easy to select columns that satisfy a criterium:

gasoline %>% select_if(is.numeric)

## # A tibble: 342 x 5
##     year lgaspcar lincomep  lrpmg lcarpcap
##  * <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1  1960     4.17    -6.47 -0.335    -9.77
##  2  1961     4.10    -6.43 -0.351    -9.61
##  3  1962     4.07    -6.41 -0.380    -9.46
##  4  1963     4.06    -6.37 -0.414    -9.34
##  5  1964     4.04    -6.32 -0.445    -9.24
##  6  1965     4.03    -6.29 -0.497    -9.12
##  7  1966     4.05    -6.25 -0.467    -9.02
##  8  1967     4.05    -6.23 -0.506    -8.93
##  9  1968     4.05    -6.21 -0.522    -8.85
## 10  1969     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

You can even pass a further function to select_if() that will be applied to the selected columns:

gasoline %>% select_if(is.numeric, toupper)

## # A tibble: 342 x 5
##     YEAR LGASPCAR LINCOMEP  LRPMG LCARPCAP
##  * <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1  1960     4.17    -6.47 -0.335    -9.77
##  2  1961     4.10    -6.43 -0.351    -9.61
##  3  1962     4.07    -6.41 -0.380    -9.46
##  4  1963     4.06    -6.37 -0.414    -9.34
##  5  1964     4.04    -6.32 -0.445    -9.24
##  6  1965     4.03    -6.29 -0.497    -9.12
##  7  1966     4.05    -6.25 -0.467    -9.02
##  8  1967     4.05    -6.23 -0.506    -8.93
##  9  1968     4.05    -6.21 -0.522    -8.85
## 10  1969     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

select_at() makes it easy to select all the variables that start with “l” for example:

gasoline %>% select_at(vars(starts_with("l")))

## # A tibble: 342 x 4
##    lgaspcar lincomep  lrpmg lcarpcap
##  *    <dbl>    <dbl>  <dbl>    <dbl>
##  1     4.17    -6.47 -0.335    -9.77
##  2     4.10    -6.43 -0.351    -9.61
##  3     4.07    -6.41 -0.380    -9.46
##  4     4.06    -6.37 -0.414    -9.34
##  5     4.04    -6.32 -0.445    -9.24
##  6     4.03    -6.29 -0.497    -9.12
##  7     4.05    -6.25 -0.467    -9.02
##  8     4.05    -6.23 -0.506    -8.93
##  9     4.05    -6.21 -0.522    -8.85
## 10     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

select_at() also works if you specify the position of the columns you’re interested in:

gasoline %>% select_at(vars(c(1,2,5)))

## # A tibble: 342 x 3
##    country  year  lrpmg
##  * <fct>   <int>  <dbl>
##  1 AUSTRIA  1960 -0.335
##  2 AUSTRIA  1961 -0.351
##  3 AUSTRIA  1962 -0.380
##  4 AUSTRIA  1963 -0.414
##  5 AUSTRIA  1964 -0.445
##  6 AUSTRIA  1965 -0.497
##  7 AUSTRIA  1966 -0.467
##  8 AUSTRIA  1967 -0.506
##  9 AUSTRIA  1968 -0.522
## 10 AUSTRIA  1969 -0.559
## # ... with 332 more rows

Notice that I use a new helper function, vars(), which allows me to specify which variables I want to select. This is similar to the helper functions we have seen before, all_vars() and any_vars(). So the way to read the above line would be something like “Start with the gasoline data, then select the variables at position 1, 2, 5”. Knowing how to read your code makes it even easier to write.

There is also a select_all(), and you might be wondering why it is useful:

gasoline %>% select_all()

## # A tibble: 342 x 6
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##  * <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

This simply returns all the columns of the data frame… but wait, selec_all() allows you to do something very useful, which is rename all the columns in bulk:

gasoline %>% select_all(toupper)

## # A tibble: 342 x 6
##    COUNTRY  YEAR LGASPCAR LINCOMEP  LRPMG LCARPCAP
##  * <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

You can apply any arbitrary function:

gasoline %>% select_all(funs(paste0("hello_", .)))

## # A tibble: 342 x 6
##    hello_country hello_year hello_lgaspcar hello_lincomep hello_lrpmg
##  * <fct>              <int>          <dbl>          <dbl>       <dbl>
##  1 AUSTRIA             1960           4.17          -6.47      -0.335
##  2 AUSTRIA             1961           4.10          -6.43      -0.351
##  3 AUSTRIA             1962           4.07          -6.41      -0.380
##  4 AUSTRIA             1963           4.06          -6.37      -0.414
##  5 AUSTRIA             1964           4.04          -6.32      -0.445
##  6 AUSTRIA             1965           4.03          -6.29      -0.497
##  7 AUSTRIA             1966           4.05          -6.25      -0.467
##  8 AUSTRIA             1967           4.05          -6.23      -0.506
##  9 AUSTRIA             1968           4.05          -6.21      -0.522
## 10 AUSTRIA             1969           4.05          -6.15      -0.559
## # ... with 332 more rows, and 1 more variable: hello_lcarpcap <dbl>

I have passed the paste0() function to the funs() helper function of select_all(). funs() is used a lot with scoped verbs, just like vars(), which we have already seen.

The scoped version of rename() work just as you expect. But I’ll leave these for the exercices down below.

5.5.3 group_by_*()

To illustrate group_by_*() I have to first modify the gasoline data a little bit. As you can see below, the data column is of type integer:

gasoline

## # A tibble: 342 x 6
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##  * <fct>   <int>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 AUSTRIA  1960     4.17    -6.47 -0.335    -9.77
##  2 AUSTRIA  1961     4.10    -6.43 -0.351    -9.61
##  3 AUSTRIA  1962     4.07    -6.41 -0.380    -9.46
##  4 AUSTRIA  1963     4.06    -6.37 -0.414    -9.34
##  5 AUSTRIA  1964     4.04    -6.32 -0.445    -9.24
##  6 AUSTRIA  1965     4.03    -6.29 -0.497    -9.12
##  7 AUSTRIA  1966     4.05    -6.25 -0.467    -9.02
##  8 AUSTRIA  1967     4.05    -6.23 -0.506    -8.93
##  9 AUSTRIA  1968     4.05    -6.21 -0.522    -8.85
## 10 AUSTRIA  1969     4.05    -6.15 -0.559    -8.79
## # ... with 332 more rows

Let’s change that to character:

gasoline <- gasoline %>%
    mutate(year = as.character(year))

Now, this allows me to do the following:

gasoline %>%
    group_by_if(is.character) %>%
    summarise(mean(lincomep))

## # A tibble: 19 x 2
##    year  `mean(lincomep)`
##    <chr>            <dbl>
##  1 1960             -6.50
##  2 1961             -6.46
##  3 1962             -6.42
##  4 1963             -6.37
##  5 1964             -6.33
##  6 1965             -6.29
##  7 1966             -6.25
##  8 1967             -6.21
##  9 1968             -6.18
## 10 1969             -6.12
## 11 1970             -6.07
## 12 1971             -6.04
## 13 1972             -5.99
## 14 1973             -5.94
## 15 1974             -5.93
## 16 1975             -5.93
## 17 1976             -5.89
## 18 1977             -5.87
## 19 1978             -5.84

This is faster than having to write:

gasoline %>%
    group_by(country, year) %>%
    summarise(mean(lincomep))

## # A tibble: 342 x 3
## # Groups:   country [?]
##    country year  `mean(lincomep)`
##    <fct>   <chr>            <dbl>
##  1 AUSTRIA 1960             -6.47
##  2 AUSTRIA 1961             -6.43
##  3 AUSTRIA 1962             -6.41
##  4 AUSTRIA 1963             -6.37
##  5 AUSTRIA 1964             -6.32
##  6 AUSTRIA 1965             -6.29
##  7 AUSTRIA 1966             -6.25
##  8 AUSTRIA 1967             -6.23
##  9 AUSTRIA 1968             -6.21
## 10 AUSTRIA 1969             -6.15
## # ... with 332 more rows

You may think that having two write the name of two variables is not a huge hassle, which is true. But imagine that you have dozens of character columns that you want to group by. This is the kind of situation where group_by_if() is very useful. If you prefer, you can specify the position of the columns instead of the name of the columns, with group_by_at():

gasoline %>%
    group_by_at(vars(1, 2)) %>%
    summarise(mean(lincomep))

## # A tibble: 342 x 3
## # Groups:   country [?]
##    country year  `mean(lincomep)`
##    <fct>   <chr>            <dbl>
##  1 AUSTRIA 1960             -6.47
##  2 AUSTRIA 1961             -6.43
##  3 AUSTRIA 1962             -6.41
##  4 AUSTRIA 1963             -6.37
##  5 AUSTRIA 1964             -6.32
##  6 AUSTRIA 1965             -6.29
##  7 AUSTRIA 1966             -6.25
##  8 AUSTRIA 1967             -6.23
##  9 AUSTRIA 1968             -6.21
## 10 AUSTRIA 1969             -6.15
## # ... with 332 more rows

There is also a group_by_all(), but I fail to see the use case…

5.5.4 summarise_()

Just like for filter(), select(), and group_by(), summarise()` comes with scoped versions:

gasoline %>%
  group_by(country) %>%
  summarise_at(vars(starts_with("l")), mean)

## # A tibble: 18 x 5
##    country  lgaspcar lincomep   lrpmg lcarpcap
##    <fct>       <dbl>    <dbl>   <dbl>    <dbl>
##  1 AUSTRIA      4.06    -6.12 -0.486     -8.85
##  2 BELGIUM      3.92    -5.85 -0.326     -8.63
##  3 CANADA       4.86    -5.58 -1.05      -8.08
##  4 DENMARK      4.19    -5.76 -0.358     -8.58
##  5 FRANCE       3.82    -5.87 -0.253     -8.45
##  6 GERMANY      3.89    -5.85 -0.517     -8.51
##  7 GREECE       4.88    -6.61 -0.0339   -10.8 
##  8 IRELAND      4.23    -6.44 -0.348     -9.04
##  9 ITALY        3.73    -6.35 -0.152     -8.83
## 10 JAPAN        4.70    -6.25 -0.287     -9.95
## 11 NETHERLA     4.08    -5.92 -0.370     -8.82
## 12 NORWAY       4.11    -5.75 -0.278     -8.77
## 13 SPAIN        4.06    -5.63  0.739     -9.90
## 14 SWEDEN       4.01    -7.82 -2.71      -8.25
## 15 SWITZERL     4.24    -5.93 -0.902     -8.54
## 16 TURKEY       5.77    -7.34 -0.422    -12.5 
## 17 U.K.         3.98    -6.02 -0.459     -8.55
## 18 U.S.A.       4.82    -5.45 -1.21      -7.78

See how I managed to summarise every variable in one simple call to summarise_at()? Simply by using vars() and specifying that I was interested in the ones that started with “l” and then I specified the function I wanted. But what if I wanted to use more than one function to summarise the data? Very easy:

gasoline %>%
  group_by(country) %>%
  summarise_at(vars(starts_with("l")), funs(mean, sd, max, min))

## # A tibble: 18 x 17
##    country lgaspcar_mean lincomep_mean lrpmg_mean lcarpcap_mean lgaspcar_sd
##    <fct>           <dbl>         <dbl>      <dbl>         <dbl>       <dbl>
##  1 AUSTRIA          4.06         -6.12    -0.486          -8.85      0.0693
##  2 BELGIUM          3.92         -5.85    -0.326          -8.63      0.103 
##  3 CANADA           4.86         -5.58    -1.05           -8.08      0.0262
##  4 DENMARK          4.19         -5.76    -0.358          -8.58      0.158 
##  5 FRANCE           3.82         -5.87    -0.253          -8.45      0.0499
##  6 GERMANY          3.89         -5.85    -0.517          -8.51      0.0239
##  7 GREECE           4.88         -6.61    -0.0339        -10.8       0.255 
##  8 IRELAND          4.23         -6.44    -0.348          -9.04      0.0437
##  9 ITALY            3.73         -6.35    -0.152          -8.83      0.220 
## 10 JAPAN            4.70         -6.25    -0.287          -9.95      0.684 
## 11 NETHER…          4.08         -5.92    -0.370          -8.82      0.286 
## 12 NORWAY           4.11         -5.75    -0.278          -8.77      0.123 
## 13 SPAIN            4.06         -5.63     0.739          -9.90      0.317 
## 14 SWEDEN           4.01         -7.82    -2.71           -8.25      0.0364
## 15 SWITZE…          4.24         -5.93    -0.902          -8.54      0.102 
## 16 TURKEY           5.77         -7.34    -0.422         -12.5       0.329 
## 17 U.K.             3.98         -6.02    -0.459          -8.55      0.0479
## 18 U.S.A.           4.82         -5.45    -1.21           -7.78      0.0219
## # ... with 11 more variables: lincomep_sd <dbl>, lrpmg_sd <dbl>,
## #   lcarpcap_sd <dbl>, lgaspcar_max <dbl>, lincomep_max <dbl>,
## #   lrpmg_max <dbl>, lcarpcap_max <dbl>, lgaspcar_min <dbl>,
## #   lincomep_min <dbl>, lrpmg_min <dbl>, lcarpcap_min <dbl>

Just use vars() to specify the variables you want to summarise, and then funs() to list all the functions you want to use on each of the columns.

But maybe you’re just interested in descriptive statistics for some variables, but not all those that start with “l”? What if you want to use another pattern? Easy to do with the contains() helper:

gasoline %>%
  group_by(country) %>%
  summarise_at(vars(dplyr::contains("car")), funs(mean, sd, max, min))

## # A tibble: 18 x 9
##    country lgaspcar_mean lcarpcap_mean lgaspcar_sd lcarpcap_sd lgaspcar_max
##    <fct>           <dbl>         <dbl>       <dbl>       <dbl>        <dbl>
##  1 AUSTRIA          4.06         -8.85      0.0693       0.473         4.20
##  2 BELGIUM          3.92         -8.63      0.103        0.417         4.16
##  3 CANADA           4.86         -8.08      0.0262       0.195         4.90
##  4 DENMARK          4.19         -8.58      0.158        0.349         4.50
##  5 FRANCE           3.82         -8.45      0.0499       0.344         3.91
##  6 GERMANY          3.89         -8.51      0.0239       0.406         3.93
##  7 GREECE           4.88        -10.8       0.255        0.839         5.38
##  8 IRELAND          4.23         -9.04      0.0437       0.345         4.33
##  9 ITALY            3.73         -8.83      0.220        0.639         4.05
## 10 JAPAN            4.70         -9.95      0.684        1.20          6.00
## 11 NETHER…          4.08         -8.82      0.286        0.617         4.65
## 12 NORWAY           4.11         -8.77      0.123        0.438         4.44
## 13 SPAIN            4.06         -9.90      0.317        0.960         4.75
## 14 SWEDEN           4.01         -8.25      0.0364       0.242         4.07
## 15 SWITZE…          4.24         -8.54      0.102        0.378         4.44
## 16 TURKEY           5.77        -12.5       0.329        0.751         6.16
## 17 U.K.             3.98         -8.55      0.0479       0.281         4.10
## 18 U.S.A.           4.82         -7.78      0.0219       0.162         4.86
## # ... with 3 more variables: lcarpcap_max <dbl>, lgaspcar_min <dbl>,
## #   lcarpcap_min <dbl>

I used dplyr::contains() instead of simply contains() because there’s also a purrr::contains(). If you load purrr after dplyr, contains() will actually be purrr::contains() and not dplyr::contains() which causes the above code to fail.

There’s also summarise_if():

gasoline %>%
  group_by(country) %>%
  summarise_if(is.double, funs(mean, sd, min, max))

## # A tibble: 18 x 17
##    country lgaspcar_mean lincomep_mean lrpmg_mean lcarpcap_mean lgaspcar_sd
##    <fct>           <dbl>         <dbl>      <dbl>         <dbl>       <dbl>
##  1 AUSTRIA          4.06         -6.12    -0.486          -8.85      0.0693
##  2 BELGIUM          3.92         -5.85    -0.326          -8.63      0.103 
##  3 CANADA           4.86         -5.58    -1.05           -8.08      0.0262
##  4 DENMARK          4.19         -5.76    -0.358          -8.58      0.158 
##  5 FRANCE           3.82         -5.87    -0.253          -8.45      0.0499
##  6 GERMANY          3.89         -5.85    -0.517          -8.51      0.0239
##  7 GREECE           4.88         -6.61    -0.0339        -10.8       0.255 
##  8 IRELAND          4.23         -6.44    -0.348          -9.04      0.0437
##  9 ITALY            3.73         -6.35    -0.152          -8.83      0.220 
## 10 JAPAN            4.70         -6.25    -0.287          -9.95      0.684 
## 11 NETHER…          4.08         -5.92    -0.370          -8.82      0.286 
## 12 NORWAY           4.11         -5.75    -0.278          -8.77      0.123 
## 13 SPAIN            4.06         -5.63     0.739          -9.90      0.317 
## 14 SWEDEN           4.01         -7.82    -2.71           -8.25      0.0364
## 15 SWITZE…          4.24         -5.93    -0.902          -8.54      0.102 
## 16 TURKEY           5.77         -7.34    -0.422         -12.5       0.329 
## 17 U.K.             3.98         -6.02    -0.459          -8.55      0.0479
## 18 U.S.A.           4.82         -5.45    -1.21           -7.78      0.0219
## # ... with 11 more variables: lincomep_sd <dbl>, lrpmg_sd <dbl>,
## #   lcarpcap_sd <dbl>, lgaspcar_min <dbl>, lincomep_min <dbl>,
## #   lrpmg_min <dbl>, lcarpcap_min <dbl>, lgaspcar_max <dbl>,
## #   lincomep_max <dbl>, lrpmg_max <dbl>, lcarpcap_max <dbl>

This allows you to summarise every column that contain real numbers. The difference between is.double() and is.numeric() is that is.numeric() returns TRUE for integers too, whereas is.double() returns TRUE for real numbers only (integers are real numbers too, but you know what I mean). To go faster, you can also use summarise_all():

gasoline %>%
  select(-year) %>%
  group_by(country) %>%
  summarise_all(funs(mean, sd, min, max))

## # A tibble: 18 x 17
##    country lgaspcar_mean lincomep_mean lrpmg_mean lcarpcap_mean lgaspcar_sd
##    <fct>           <dbl>         <dbl>      <dbl>         <dbl>       <dbl>
##  1 AUSTRIA          4.06         -6.12    -0.486          -8.85      0.0693
##  2 BELGIUM          3.92         -5.85    -0.326          -8.63      0.103 
##  3 CANADA           4.86         -5.58    -1.05           -8.08      0.0262
##  4 DENMARK          4.19         -5.76    -0.358          -8.58      0.158 
##  5 FRANCE           3.82         -5.87    -0.253          -8.45      0.0499
##  6 GERMANY          3.89         -5.85    -0.517          -8.51      0.0239
##  7 GREECE           4.88         -6.61    -0.0339        -10.8       0.255 
##  8 IRELAND          4.23         -6.44    -0.348          -9.04      0.0437
##  9 ITALY            3.73         -6.35    -0.152          -8.83      0.220 
## 10 JAPAN            4.70         -6.25    -0.287          -9.95      0.684 
## 11 NETHER…          4.08         -5.92    -0.370          -8.82      0.286 
## 12 NORWAY           4.11         -5.75    -0.278          -8.77      0.123 
## 13 SPAIN            4.06         -5.63     0.739          -9.90      0.317 
## 14 SWEDEN           4.01         -7.82    -2.71           -8.25      0.0364
## 15 SWITZE…          4.24         -5.93    -0.902          -8.54      0.102 
## 16 TURKEY           5.77         -7.34    -0.422         -12.5       0.329 
## 17 U.K.             3.98         -6.02    -0.459          -8.55      0.0479
## 18 U.S.A.           4.82         -5.45    -1.21           -7.78      0.0219
## # ... with 11 more variables: lincomep_sd <dbl>, lrpmg_sd <dbl>,
## #   lcarpcap_sd <dbl>, lgaspcar_min <dbl>, lincomep_min <dbl>,
## #   lrpmg_min <dbl>, lcarpcap_min <dbl>, lgaspcar_max <dbl>,
## #   lincomep_max <dbl>, lrpmg_max <dbl>, lcarpcap_max <dbl>

I removed the year variable because it’s not a variable for which we want to have descriptive statistics.

5.5.5 mutate_*()

Of course, mutate() and transmute() also come with scoped versions:

gasoline %>%
  mutate_if(is.double, exp)

## # A tibble: 342 x 6
##    country year  lgaspcar lincomep lrpmg  lcarpcap
##    <fct>   <chr>    <dbl>    <dbl> <dbl>     <dbl>
##  1 AUSTRIA 1960      64.9  0.00154 0.716 0.0000573
##  2 AUSTRIA 1961      60.4  0.00162 0.704 0.0000671
##  3 AUSTRIA 1962      58.7  0.00165 0.684 0.0000781
##  4 AUSTRIA 1963      57.9  0.00171 0.661 0.0000876
##  5 AUSTRIA 1964      56.7  0.00180 0.641 0.0000973
##  6 AUSTRIA 1965      56.5  0.00185 0.608 0.000109 
##  7 AUSTRIA 1966      57.3  0.00193 0.627 0.000121 
##  8 AUSTRIA 1967      57.6  0.00196 0.603 0.000132 
##  9 AUSTRIA 1968      57.1  0.00202 0.593 0.000144 
## 10 AUSTRIA 1969      57.2  0.00213 0.572 0.000152 
## # ... with 332 more rows

gasoline %>%
  mutate_at(vars(starts_with("l")), exp)

## # A tibble: 342 x 6
##    country year  lgaspcar lincomep lrpmg  lcarpcap
##    <fct>   <chr>    <dbl>    <dbl> <dbl>     <dbl>
##  1 AUSTRIA 1960      64.9  0.00154 0.716 0.0000573
##  2 AUSTRIA 1961      60.4  0.00162 0.704 0.0000671
##  3 AUSTRIA 1962      58.7  0.00165 0.684 0.0000781
##  4 AUSTRIA 1963      57.9  0.00171 0.661 0.0000876
##  5 AUSTRIA 1964      56.7  0.00180 0.641 0.0000973
##  6 AUSTRIA 1965      56.5  0.00185 0.608 0.000109 
##  7 AUSTRIA 1966      57.3  0.00193 0.627 0.000121 
##  8 AUSTRIA 1967      57.6  0.00196 0.603 0.000132 
##  9 AUSTRIA 1968      57.1  0.00202 0.593 0.000144 
## 10 AUSTRIA 1969      57.2  0.00213 0.572 0.000152 
## # ... with 332 more rows

gasoline %>%
  mutate_all(as.character)

## # A tibble: 342 x 6
##    country year  lgaspcar     lincomep     lrpmg        lcarpcap    
##    <chr>   <chr> <chr>        <chr>        <chr>        <chr>       
##  1 AUSTRIA 1960  4.173244195  -6.474277179 -0.334547613 -9.766839569
##  2 AUSTRIA 1961  4.1009891049 -6.426005835 -0.351327614 -9.608621845
##  3 AUSTRIA 1962  4.0731765511 -6.407308295 -0.379517692 -9.457256552
##  4 AUSTRIA 1963  4.0595091239 -6.370678539 -0.414251392 -9.343154947
##  5 AUSTRIA 1964  4.037688787  -6.322246805 -0.445335362 -9.237739346
##  6 AUSTRIA 1965  4.033983285  -6.294667914 -0.497060662 -9.123903477
##  7 AUSTRIA 1966  4.0475365589 -6.252545451 -0.466837731 -9.019822048
##  8 AUSTRIA 1967  4.0529106939 -6.234580709 -0.505883405 -8.934402537
##  9 AUSTRIA 1968  4.045507048  -6.206894403 -0.522412545 -8.847967407
## 10 AUSTRIA 1969  4.0463547891 -6.153139668 -0.559110514 -8.788686207
## # ... with 332 more rows

I think that by now, you are able to understand the above lines quite easily. If not, try some other functions, or mutating other variables and you’ll see that it is not that complicated!

5.6 Other useful `{tidyverse}` functions

5.6.1 `if_else()`, `case_when()` and `recode()`

Some other very useful {tidyverse} functions are if_else() and case_when. These two functions, combined with mutate() make it easy to create a new variable whose values must respect certain conditions. For instance, we might want to have a dummy that equals 1 if a country in the European Union (to simplify, say as of 2017) and 0 if not. First let’s create a list of countries that are in the EU:

eu_countries <- c("austria", "belgium", "bulgaria", "croatia", "republic of cyprus",
                  "czech republic", "denmark", "estonia", "finland", "france", "germany",
                  "greece", "hungary", "ireland", "italy", "latvia", "lithuania", "luxembourg",
                  "malta", "netherla", "poland", "portugal", "romania", "slovakia", "slovenia",
                  "spain", "sweden", "u.k.")

I’ve had to change “netherlands” to “netherla” because that’s how the country is called in the gasoline data. Now let’s create a dummy variable that equals 1 for EU countries, and 0 for the others:

gasoline %>%
  mutate(country = tolower(country)) %>%
  mutate(in_eu = if_else(country %in% eu_countries, 1, 0))

## # A tibble: 342 x 7
##    country  year lgaspcar lincomep  lrpmg lcarpcap in_eu
##    <chr>   <dbl>    <dbl>    <dbl>  <dbl>    <dbl> <dbl>
##  1 austria  1960     4.17    -6.47 -0.335    -9.77     1
##  2 austria  1961     4.10    -6.43 -0.351    -9.61     1
##  3 austria  1962     4.07    -6.41 -0.380    -9.46     1
##  4 austria  1963     4.06    -6.37 -0.414    -9.34     1
##  5 austria  1964     4.04    -6.32 -0.445    -9.24     1
##  6 austria  1965     4.03    -6.29 -0.497    -9.12     1
##  7 austria  1966     4.05    -6.25 -0.467    -9.02     1
##  8 austria  1967     4.05    -6.23 -0.506    -8.93     1
##  9 austria  1968     4.05    -6.21 -0.522    -8.85     1
## 10 austria  1969     4.05    -6.15 -0.559    -8.79     1
## # ... with 332 more rows

Instead of 1 and 0, we can of course use strings (I add filter(year == 1960) at the end to have a better view of what happened):

gasoline %>%
  mutate(country = tolower(country)) %>%
  mutate(in_eu = if_else(country %in% eu_countries, "yes", "no")) %>%
  filter(year == 1960)

## # A tibble: 18 x 7
##    country   year lgaspcar lincomep   lrpmg lcarpcap in_eu
##    <chr>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl> <chr>
##  1 austria   1960     4.17    -6.47 -0.335     -9.77 yes  
##  2 belgium   1960     4.16    -6.22 -0.166     -9.41 yes  
##  3 canada    1960     4.86    -5.89 -0.972     -8.38 no   
##  4 denmark   1960     4.50    -6.06 -0.196     -9.33 yes  
##  5 france    1960     3.91    -6.26 -0.0196    -9.15 yes  
##  6 germany   1960     3.92    -6.16 -0.186     -9.34 yes  
##  7 greece    1960     5.04    -7.16 -0.0835   -12.2  yes  
##  8 ireland   1960     4.27    -6.72 -0.0765    -9.70 yes  
##  9 italy     1960     4.05    -6.73  0.165    -10.1  yes  
## 10 japan     1960     6.00    -6.99 -0.145    -12.2  no   
## 11 netherla  1960     4.65    -6.22 -0.201    -10.00 yes  
## 12 norway    1960     4.44    -6.09 -0.140     -9.68 no   
## 13 spain     1960     4.75    -6.17  1.13     -11.6  yes  
## 14 sweden    1960     4.06    -8.07 -2.52      -8.74 yes  
## 15 switzerl  1960     4.40    -6.16 -0.823     -9.26 no   
## 16 turkey    1960     6.13    -7.80 -0.253    -13.5  no   
## 17 u.k.      1960     4.10    -6.19 -0.391     -9.12 yes  
## 18 u.s.a.    1960     4.82    -5.70 -1.12      -8.02 no

I think that if_else() is fairly straightforward, especially if you know ifelse() already. You might be wondering what is the difference between these two. if_else() is stricter than ifelse() and does not do type conversion. Compare the two next lines:

ifelse(1 == 1, "0", 1)

## [1] "0"

if_else(1 == 1, "0", 1)

Error: `false` must be type string, not double

Type conversion, especially without a warning is very dangerous. if_else()’s behaviour which consists in failing as soon as possble avoids a lot of pain and suffering, especially when programming non-interactively.

if_else() also accepts an optional argument, that allows you to specify what should be returned in case of NA:

if_else(1 <= NA, 0, 1, 999)

## [1] 999

# Or
if_else(1 <= NA, 0, 1, NA_real_)

## [1] NA

case_when() can be seen as a generalization of if_else(). Whenever you want to use multiple if_else()s, that’s when you know you should use case_when() (I’m adding the filter at the end for the same reason as before, to see the output better):

gasoline %>%
  mutate(country = tolower(country)) %>%
  mutate(region = case_when(
           country %in% c("france", "italy", "turkey", "greece", "spain") ~ "mediterranean",
           country %in% c("germany", "austria", "switzerl", "belgium", "netherla") ~ "central europe",
           country %in% c("canada", "u.s.a.", "u.k.", "ireland") ~ "anglosphere",
           country %in% c("denmark", "norway", "sweden") ~ "nordic",
           country %in% c("japan") ~ "asia")) %>%
  filter(year == 1960)

## # A tibble: 18 x 7
##    country   year lgaspcar lincomep   lrpmg lcarpcap region        
##    <chr>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl> <chr>         
##  1 austria   1960     4.17    -6.47 -0.335     -9.77 central europe
##  2 belgium   1960     4.16    -6.22 -0.166     -9.41 central europe
##  3 canada    1960     4.86    -5.89 -0.972     -8.38 anglosphere   
##  4 denmark   1960     4.50    -6.06 -0.196     -9.33 nordic        
##  5 france    1960     3.91    -6.26 -0.0196    -9.15 mediterranean 
##  6 germany   1960     3.92    -6.16 -0.186     -9.34 central europe
##  7 greece    1960     5.04    -7.16 -0.0835   -12.2  mediterranean 
##  8 ireland   1960     4.27    -6.72 -0.0765    -9.70 anglosphere   
##  9 italy     1960     4.05    -6.73  0.165    -10.1  mediterranean 
## 10 japan     1960     6.00    -6.99 -0.145    -12.2  asia          
## 11 netherla  1960     4.65    -6.22 -0.201    -10.00 central europe
## 12 norway    1960     4.44    -6.09 -0.140     -9.68 nordic        
## 13 spain     1960     4.75    -6.17  1.13     -11.6  mediterranean 
## 14 sweden    1960     4.06    -8.07 -2.52      -8.74 nordic        
## 15 switzerl  1960     4.40    -6.16 -0.823     -9.26 central europe
## 16 turkey    1960     6.13    -7.80 -0.253    -13.5  mediterranean 
## 17 u.k.      1960     4.10    -6.19 -0.391     -9.12 anglosphere   
## 18 u.s.a.    1960     4.82    -5.70 -1.12      -8.02 anglosphere

If all you want is to recode values, you can use recode(). For example, the Netherlands is written as “NETHERLA” in the gasoline data, which is quite ugly. Same for Switzerland:

gasoline <- gasoline %>%
  mutate(country = tolower(country)) %>%
  mutate(country = recode(country, "netherla" = "netherlands", "switzerl" = "switzerland"))

I saved the data with these changes as they will become useful in the future. Let’s take a look at the data:

gasoline %>%
  filter(country %in% c("netherlands", "switzerland"), year == 1960)

## # A tibble: 2 x 6
##   country      year lgaspcar lincomep  lrpmg lcarpcap
##   <chr>       <dbl>    <dbl>    <dbl>  <dbl>    <dbl>
## 1 netherlands  1960     4.65    -6.22 -0.201   -10.00
## 2 switzerland  1960     4.40    -6.16 -0.823    -9.26

5.6.2 `lead()` and `lag()`

lead() and lag() are especially useful in econometrics. When I was doing my masters, in 4 B.d. (Before dplyr) lagging variables in panel data was quite tricky. Now, with dplyr it’s really very easy:

gasoline %>%
  group_by(country) %>%
  mutate(lag_lgaspcar = lag(lgaspcar)) %>%
  mutate(lead_lgaspcar = lead(lgaspcar)) %>%
  filter(year %in% seq(1960, 1963))

## # A tibble: 72 x 8
## # Groups:   country [18]
##    country  year lgaspcar lincomep  lrpmg lcarpcap lag_lgaspcar
##    <chr>   <dbl>    <dbl>    <dbl>  <dbl>    <dbl>        <dbl>
##  1 austria  1960     4.17    -6.47 -0.335    -9.77        NA   
##  2 austria  1961     4.10    -6.43 -0.351    -9.61         4.17
##  3 austria  1962     4.07    -6.41 -0.380    -9.46         4.10
##  4 austria  1963     4.06    -6.37 -0.414    -9.34         4.07
##  5 belgium  1960     4.16    -6.22 -0.166    -9.41        NA   
##  6 belgium  1961     4.12    -6.18 -0.172    -9.30         4.16
##  7 belgium  1962     4.08    -6.13 -0.222    -9.22         4.12
##  8 belgium  1963     4.00    -6.09 -0.250    -9.11         4.08
##  9 canada   1960     4.86    -5.89 -0.972    -8.38        NA   
## 10 canada   1961     4.83    -5.88 -0.972    -8.35         4.86
## # ... with 62 more rows, and 1 more variable: lead_lgaspcar <dbl>

To lag every variable, remember that you can use mutate_if():

gasoline %>%
  group_by(country) %>%
  mutate_if(is.double, lag) %>%
  filter(year %in% seq(1960, 1963))

## # A tibble: 72 x 6
## # Groups:   country [18]
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##    <chr>   <dbl>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 austria  1960     4.17    -6.47 -0.335    -9.77
##  2 austria  1961     4.10    -6.43 -0.351    -9.61
##  3 austria  1962     4.07    -6.41 -0.380    -9.46
##  4 austria  1963     4.06    -6.37 -0.414    -9.34
##  5 belgium  1960     4.16    -6.22 -0.166    -9.41
##  6 belgium  1961     4.12    -6.18 -0.172    -9.30
##  7 belgium  1962     4.08    -6.13 -0.222    -9.22
##  8 belgium  1963     4.00    -6.09 -0.250    -9.11
##  9 canada   1960     4.86    -5.89 -0.972    -8.38
## 10 canada   1961     4.83    -5.88 -0.972    -8.35
## # ... with 62 more rows

you can replace lag() with lead(), but just keep in mind that the columns get transformed in place.

5.6.3 `ntile()`

The last helper function I will discuss is ntile(). There are some other, so do read mutate()’s documentation with help(mutate)!

If you need quantiles, you need ntile(). Let’s see how it works:

gasoline %>%
  mutate(quintile = ntile(lgaspcar, 5)) %>%
  mutate(decile = ntile(lgaspcar, 10)) %>%
  select(country, year, lgaspcar, quintile, decile)

## # A tibble: 342 x 5
##    country  year lgaspcar quintile decile
##    <chr>   <dbl>    <dbl>    <int>  <int>
##  1 austria  1960     4.17        3      6
##  2 austria  1961     4.10        3      6
##  3 austria  1962     4.07        3      5
##  4 austria  1963     4.06        3      5
##  5 austria  1964     4.04        3      5
##  6 austria  1965     4.03        3      5
##  7 austria  1966     4.05        3      5
##  8 austria  1967     4.05        3      5
##  9 austria  1968     4.05        3      5
## 10 austria  1969     4.05        3      5
## # ... with 332 more rows

quintile and decile do not hold the values but the quantile the value lies in. If you want to have a column that contains the median for instance, you can use good ol’ quantile():

gasoline %>%
  group_by(country) %>%
  mutate(median = quantile(lgaspcar, 0.5)) %>% # quantile(x, 0.5) is equivalent to median(x)
  filter(year == 1960) %>%
  select(country, year, median)

## # A tibble: 18 x 3
## # Groups:   country [18]
##    country      year median
##    <chr>       <dbl>  <dbl>
##  1 austria      1960   4.05
##  2 belgium      1960   3.88
##  3 canada       1960   4.86
##  4 denmark      1960   4.16
##  5 france       1960   3.81
##  6 germany      1960   3.89
##  7 greece       1960   4.89
##  8 ireland      1960   4.22
##  9 italy        1960   3.74
## 10 japan        1960   4.52
## 11 netherlands  1960   3.99
## 12 norway       1960   4.08
## 13 spain        1960   3.99
## 14 sweden       1960   4.00
## 15 switzerland  1960   4.26
## 16 turkey       1960   5.72
## 17 u.k.         1960   3.98
## 18 u.s.a.       1960   4.81

5.6.4 `arrange()`

arrange() re-orders the whole tibble according to values of the supplied variable:

gasoline %>%
  arrange(lgaspcar)

## # A tibble: 342 x 6
##    country  year lgaspcar lincomep   lrpmg lcarpcap
##    <chr>   <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
##  1 italy    1977     3.38    -6.10  0.164     -8.15
##  2 italy    1978     3.39    -6.08  0.0348    -8.11
##  3 italy    1976     3.43    -6.12  0.103     -8.17
##  4 italy    1974     3.50    -6.13 -0.223     -8.26
##  5 italy    1975     3.52    -6.17 -0.0327    -8.22
##  6 spain    1978     3.62    -5.29  0.621     -8.63
##  7 italy    1972     3.63    -6.21 -0.215     -8.38
##  8 italy    1971     3.65    -6.22 -0.148     -8.47
##  9 spain    1977     3.65    -5.30  0.526     -8.73
## 10 italy    1973     3.65    -6.16 -0.325     -8.32
## # ... with 332 more rows

If you want to re-order the tibble in descending order of the variable:

gasoline %>%
  arrange(desc(lgaspcar))

## # A tibble: 342 x 6
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##    <chr>   <dbl>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 turkey   1966     6.16    -7.51 -0.356    -13.0
##  2 turkey   1960     6.13    -7.80 -0.253    -13.5
##  3 turkey   1961     6.11    -7.79 -0.343    -13.4
##  4 turkey   1962     6.08    -7.84 -0.408    -13.2
##  5 turkey   1968     6.08    -7.42 -0.365    -12.8
##  6 turkey   1963     6.08    -7.63 -0.225    -13.3
##  7 turkey   1964     6.06    -7.63 -0.252    -13.2
##  8 turkey   1967     6.04    -7.46 -0.335    -12.8
##  9 japan    1960     6.00    -6.99 -0.145    -12.2
## 10 turkey   1965     5.82    -7.62 -0.293    -12.9
## # ... with 332 more rows

arrange’s documentation alerts the user that re-ording by group is only possible by explicitely specifying an option:

gasoline %>%
  filter(year %in% seq(1960, 1963)) %>%
  group_by(country) %>%
  arrange(desc(lgaspcar), .by_group = TRUE)

## # A tibble: 72 x 6
## # Groups:   country [18]
##    country  year lgaspcar lincomep  lrpmg lcarpcap
##    <chr>   <dbl>    <dbl>    <dbl>  <dbl>    <dbl>
##  1 austria  1960     4.17    -6.47 -0.335    -9.77
##  2 austria  1961     4.10    -6.43 -0.351    -9.61
##  3 austria  1962     4.07    -6.41 -0.380    -9.46
##  4 austria  1963     4.06    -6.37 -0.414    -9.34
##  5 belgium  1960     4.16    -6.22 -0.166    -9.41
##  6 belgium  1961     4.12    -6.18 -0.172    -9.30
##  7 belgium  1962     4.08    -6.13 -0.222    -9.22
##  8 belgium  1963     4.00    -6.09 -0.250    -9.11
##  9 canada   1960     4.86    -5.89 -0.972    -8.38
## 10 canada   1962     4.85    -5.84 -0.979    -8.32
## # ... with 62 more rows

This is especially useful for plotting. We’ll see this in Chapter 6.

5.6.5 `tally()` and `count()`

tally() and count() count the number of observations in your data. I believe count() is the more useful of the two, as it counts the number of observations within a group that you can provide:

gasoline %>%
  count(country)

## # A tibble: 18 x 2
##    country         n
##    <chr>       <int>
##  1 austria        19
##  2 belgium        19
##  3 canada         19
##  4 denmark        19
##  5 france         19
##  6 germany        19
##  7 greece         19
##  8 ireland        19
##  9 italy          19
## 10 japan          19
## 11 netherlands    19
## 12 norway         19
## 13 spain          19
## 14 sweden         19
## 15 switzerland    19
## 16 turkey         19
## 17 u.k.           19
## 18 u.s.a.         19

There’s also add_count() which adds the column to the data:

gasoline %>%
  add_count(country)

## # A tibble: 342 x 7
##    country  year lgaspcar lincomep  lrpmg lcarpcap     n
##    <chr>   <dbl>    <dbl>    <dbl>  <dbl>    <dbl> <int>
##  1 austria  1960     4.17    -6.47 -0.335    -9.77    19
##  2 austria  1961     4.10    -6.43 -0.351    -9.61    19
##  3 austria  1962     4.07    -6.41 -0.380    -9.46    19
##  4 austria  1963     4.06    -6.37 -0.414    -9.34    19
##  5 austria  1964     4.04    -6.32 -0.445    -9.24    19
##  6 austria  1965     4.03    -6.29 -0.497    -9.12    19
##  7 austria  1966     4.05    -6.25 -0.467    -9.02    19
##  8 austria  1967     4.05    -6.23 -0.506    -8.93    19
##  9 austria  1968     4.05    -6.21 -0.522    -8.85    19
## 10 austria  1969     4.05    -6.15 -0.559    -8.79    19
## # ... with 332 more rows

add_count() is a shortcut for the following code:

gasoline %>%
  group_by(country) %>%
  mutate(n = n())

## # A tibble: 342 x 7
## # Groups:   country [18]
##    country  year lgaspcar lincomep  lrpmg lcarpcap     n
##    <chr>   <dbl>    <dbl>    <dbl>  <dbl>    <dbl> <int>
##  1 austria  1960     4.17    -6.47 -0.335    -9.77    19
##  2 austria  1961     4.10    -6.43 -0.351    -9.61    19
##  3 austria  1962     4.07    -6.41 -0.380    -9.46    19
##  4 austria  1963     4.06    -6.37 -0.414    -9.34    19
##  5 austria  1964     4.04    -6.32 -0.445    -9.24    19
##  6 austria  1965     4.03    -6.29 -0.497    -9.12    19
##  7 austria  1966     4.05    -6.25 -0.467    -9.02    19
##  8 austria  1967     4.05    -6.23 -0.506    -8.93    19
##  9 austria  1968     4.05    -6.21 -0.522    -8.85    19
## 10 austria  1969     4.05    -6.15 -0.559    -8.79    19
## # ... with 332 more rows

where n() is a dplyr function that can only be used within summarise(), mutate() and filter().

5.7 Special packages for special kinds of data: `{forcats}`, `{lubridate}`, and `{stringr}`

5.7.1 🐈🐈🐈🐈

Factor variables are very useful but not very easy to manipulate. forcats contains very useful functions that make working on factor variables painless. In my opinion, the four following functions, fct_recode(), fct_relevel(), fct_reorder() and fct_relabel(), are the ones you must know, so that’s what I’ll be showing.

Remember in chapter 3 when I very quickly explained what were factor variables? In this section, we are going to work a little bit with these type of variable. factors are very useful, and the forcats package includes some handy functions to work with them. First, let’s load the forcats package:

library(forcats)

as an example, we are going to work with the gss_cat dataset that is included in forcats. Let’s load the data:

data(gss_cat)

head(gss_cat)

## # A tibble: 6 x 9
##    year marital     age race  rincome   partyid    relig   denom   tvhours
##   <int> <fct>     <int> <fct> <fct>     <fct>      <fct>   <fct>     <int>
## 1  2000 Never ma…    26 White $8000 to… Ind,near … Protes… Southe…      12
## 2  2000 Divorced     48 White $8000 to… Not str r… Protes… Baptis…      NA
## 3  2000 Widowed      67 White Not appl… Independe… Protes… No den…       2
## 4  2000 Never ma…    39 White Not appl… Ind,near … Orthod… Not ap…       4
## 5  2000 Divorced     25 White Not appl… Not str d… None    Not ap…       1
## 6  2000 Married      25 White $20000 -… Strong de… Protes… Southe…      NA

as you can see, marital, race, rincome and partyid are all factor variables. Let’s take a closer look at marital:

str(gss_cat$marital)

##  Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...

and let’s see rincome:

str(gss_cat$rincome)

##  Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...

factor variables have different levels and the forcats package includes functions that allow you to recode, collapse and do all sorts of things on these levels. For example , using forcats::fct_recode() you can recode levels:

gss_cat <- gss_cat %>%
  mutate(marital = fct_recode(marital,
                              refuse = "No answer",
                              never_married = "Never married",
                              divorced = "Separated",
                              divorced = "Divorced",
                              widowed = "Widowed",
                              married = "Married"))

gss_cat %>%
  tabyl(marital)

##        marital     n      percent
##         refuse    17 0.0007913234
##  never_married  5416 0.2521063166
##       divorced  4126 0.1920588372
##        widowed  1807 0.0841130196
##        married 10117 0.4709305032

Using fct_recode(), I was able to recode the levels and collapse Separated and Divorced to a single category called divorced. As you can see, refuse and widowed are less than 10%, so maybe you’d want to lump these categories together:

gss_cat <- gss_cat %>%
  mutate(marital = fct_lump(marital, prop = 0.10, other_level = "other"))

gss_cat %>%
  tabyl(marital)

##        marital     n    percent
##  never_married  5416 0.25210632
##       divorced  4126 0.19205884
##        married 10117 0.47093050
##          other  1824 0.08490434

fct_reorder() is especially useful for plotting. We will explore plotting in the next chapter, but to show you why fct_reorder() is so useful, I will create a barplot, first without using fct_reorder() to re-order the factors, then with reordering. Do not worry if you don’t understand all the code for now:

gss_cat %>%
    tabyl(marital) %>%
    ggplot() +
    geom_col(aes(y = n, x = marital)) +
    coord_flip()

It would be much better if the categories were ordered by frequency. This is easy to do with fct_reorder():

gss_cat %>%
    tabyl(marital) %>%
    mutate(marital = fct_reorder(marital, n, .desc = FALSE)) %>%
    ggplot() +
    geom_col(aes(y = n, x = marital)) +
    coord_flip()

Much better! In Chapter 6, we are going to learn about {ggplot2}.

{forcats} contains other very useful functions, so I urge you to go through the documentation.

5.7.2 Get your dates right with `{lubridate}`

Just like {forcats}, {lubridate} contains numerous functions, each one of them very useful. However, here, I will only focus on a handful of them (those that I use very often). If you need to work with dates a lot in your work, consider reading all the documentation.

5.7.3 Manipulate strings with `{stringr}`

{stringr} contains functions to manipulate strings. In Chapter 11, I will teach you about regular expressions, but the functions contained in {stringr} allow you to already do a lot of work on strings, without needing to be a regular expression expert.

I will discuss the most common string operations: search and replace (or remove), detect, locate or match, and trim a string.

5.8 List-columns

To learn about list-columns, let’s first focus on a single character of the starwars dataset:

data(starwars)

starwars %>%
  filter(name == "Luke Skywalker") %>%
  glimpse()

## Observations: 1
## Variables: 13
## $ name       <chr> "Luke Skywalker"
## $ height     <int> 172
## $ mass       <dbl> 77
## $ hair_color <chr> "blond"
## $ skin_color <chr> "fair"
## $ eye_color  <chr> "blue"
## $ birth_year <dbl> 19
## $ gender     <chr> "male"
## $ homeworld  <chr> "Tatooine"
## $ species    <chr> "Human"
## $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
## $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">]
## $ starships  <list> [<"X-wing", "Imperial shuttle">]

We see that the columns films, vehicles and starships are all lists, and in the case of films, it lists all the films where Luke Skywalker has appeared. What if you want to take a closer look at this list?

starwars %>%
  filter(name == "Luke Skywalker") %>%
  pull(films)

## [[1]]
## [1] "Revenge of the Sith"     "Return of the Jedi"     
## [3] "The Empire Strikes Back" "A New Hope"             
## [5] "The Force Awakens"

pull() is a dplyr function that extract (pulls) the column you’re interested in. It is quite useful when you want to inspect a column.

Suppose we want to create a categorical variable which counts the number of movies in which the characters have appeared. For this we need to compute the length of the list, or count the number of elements this list has. Let’s try with length() a base R function:

starwars %>%
  filter(name == "Luke Skywalker") %>%
  pull(films) %>%
  length()

## [1] 1

This might be surprising at first, because we know that Luke Skywalker has appeared in more than 1 movie… the problem here is that for each individual, films is a list, whose single element is a vector of characters. This means that length(films) computes the length of the list, which is one, and not the length of the vector contained in the list! How can we get the length of the vector of characters contained in the list and for each character? For this we need to use dplyr::rowwise() and remove the filter() function and use mutate() to add this column to the dataset:

starwars <- starwars %>%
  rowwise() %>%
  mutate(n_films = length(films))

dplyr::rowwise() is useful when working with list-columns: columns that have lists as elements.

Let’s take a look at the characters and the number of films they have appeared in:

starwars %>%
  select(name, n_films)

## Source: local data frame [87 x 2]
## Groups: <by row>
## 
## # A tibble: 87 x 2
##    name               n_films
##    <chr>                <int>
##  1 Luke Skywalker           5
##  2 C-3PO                    6
##  3 R2-D2                    7
##  4 Darth Vader              4
##  5 Leia Organa              5
##  6 Owen Lars                3
##  7 Beru Whitesun lars       3
##  8 R5-D4                    1
##  9 Biggs Darklighter        1
## 10 Obi-Wan Kenobi           6
## # ... with 77 more rows

Now we can create a factor variable that groups characters by asking whether they appeared only in 1 movie, or more:

starwars <- starwars %>%
  mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie",
                            n_films != 1 ~ "More than 1 movie"))

case_when() is a dplyr function that works similarly to the standard if..else.. construct of many programming languages (R also has this, we are going to learn about it in later chapters).

You can also create list columns with your own datasets, by using tidyr::nest(). Remember the fake survey_data I created to illustrate spread() and gather()? Let’s go back to that dataset again:

print(survey_data)

## # A tibble: 12 x 3
##       id variable value
##    <dbl> <chr>    <dbl>
##  1     1 var1       1  
##  2     1 var2       0.2
##  3    NA var3       0.3
##  4     2 var1       1.4
##  5     2 var2       1.9
##  6     2 var3       4.1
##  7     3 var1       0.1
##  8     3 var2       2.8
##  9     3 var3       8.9
## 10     4 var1       1.7
## 11    NA var2       1.9
## 12     4 var3       7.6

nested_data = survey_data %>%
  nest(variable, value)

print(nested_data)

## # A tibble: 5 x 2
##      id data            
##   <dbl> <list>          
## 1     1 <tibble [2 × 2]>
## 2    NA <tibble [2 × 2]>
## 3     2 <tibble [3 × 2]>
## 4     3 <tibble [3 × 2]>
## 5     4 <tibble [2 × 2]>

This creates a new tibble, with columns id and data. data is a list-column that contains tibbles; each tibble is the variable and value for each individual:

nested_data %>%
  filter(id == "1") %>%
  pull(data)

## [[1]]
## # A tibble: 2 x 2
##   variable value
##   <chr>    <dbl>
## 1 var1       1  
## 2 var2       0.2

As you can see, for individual 1, the column data contains a 2x2 tibble with columns variable and value. You might be wondering why this is useful, because this seems to introduce an unnecessary layer of complexity. The usefulness of list-columns will become apparent in the next chapters, where we are going to learn how to repeat actions over, say, individuals.

5.9 Exercises

Exercise 1

Load the LaborSupply dataset from the Ecdat package and answer the following questions:

Compute the average annual hours worked by year (plus standard deviation)
What age group worked the most hours in the year 1982?
Create a variable, n_years that equals the number of years an individual stays in the panel. Is the panel balanced?
Which are the individuals that do not have any kids during the whole period? Create a variable, no_kids, that flags these individuals (1 = no kids, 0 = kids)
Using the no_kids variable from before compute the average wage, standard deviation and number of observations in each group for the year 1980 (no kids group vs kids group).
Create the lagged logarithm of hours worked and wages. Remember that this is a panel.

Exercise 2

What does the following code do? Copy and paste it in an R interpreter to find out!

LaborSupply %>%
  group_by(id) %>%
  mutate_at(vars(starts_with("l")), funs(lag, lead))

mutate_at() is a scoped version of mutate() which allows you to specify a number of columns and functions in one go. This also exists for summarise().

Using summarise_at(), compute the mean, standard deviation and number of individuals of lnhr and lnwg for each individual.

Exercise 3

In the dataset folder you downloaded at the beginning of the chapter, there is a folder called “unemployment”. I used the data in the section about working with lists of datasets. Using rio::import_list(), read the 4 datasets into R.
Using map(), map the janitor::clean_names() function to each dataset (just like in the example in the section on working with lists of datasets). Then, still with map() and mutate() convert all commune names in the commune column with the function tolower(), in a new column called lcommune. This is not an easy exercise; so here are some hints:
- Remember that all_datasets is a list of datasets. Which function do you use when you want to map a function to each element of a list?
- Each element of all_datasets are data.frame objects. Which function do you use to add a column to a data.frame?
- What symbol can you use to access a column of a data.frame?

References

Vaudor, Lise. 2018. “Itération de Fonctions Avec Purrr.” R-Atique: Analyse de Données Avec R. http://perso.ens-lyon.fr/lise.vaudor/iterer-des-fonctions-avec-purrr/.

Chapter 5 Descriptive statistics and data manipulation

5.1 A data exploration exercice using base R

5.2 Smoking is bad for you, but pipes are your friend

5.3 The {tidyverse}’s enfant prodige: {dplyr}

5.3.1 A first taste of data manipulation with {dplyr}

5.3.2 Filter the rows of a dataset with filter()

5.3.3 Select columns with select()

5.3.4 Group the observations of your dataset with group_by()

5.3.5 Get summary statistics with summarise()

5.3.6 Adding columns with mutate() and transmute()

5.3.7 Joining tibbles with full_join(), left_join(), right_join() and all the others

5.4 Reshaping data with tidyr

5.4.1 spread() and gather()

5.4.2 fill() and full_seq()

5.4.3 Put order in your columns with separate(), unite(), and in your rows with separate_rows()

5.4.4 A sneak peek to Chapter 9; nest()

5.5 Scoped {tidyverse} verbs