Chapter 5 Descriptive statistics and data manipulation
Now that we are familiar with some R objects and know how to import data, it is time to write some
code. In this chapter, we are going to compute descriptive statistics for a single dataset, but
also for a list of datasets. However, I will not give a list of functions to compute descriptive
statistics; if you need a specific function you can find easily in the Help pane in Rstudio or
using any modern internet search engine. What I will do is show you a workflow that allows you to
compute the descripitive statisics you need fast.
R has a lot of built-in functions for descriptive statistics; however, if you want to compute
statistics by, say, gender, some more complex manipulations are needed. At least this was true in
the past. Nowadays, thanks to the packages from the tidyverse
, it is very easy and fast to
compute descriptive statistics by any stratifying variable(s). The package we are going to use for
this is called dplyr
. dplyr
contains a lot of functions that make manipulating
data and computing descriptive statistics very easy. To make things easier for now, we are going to
use example data included with dplyr
. So no need to import an external dataset; this does not
change anything to the example that we are going to study here; the source of the data does not
matter for this. Using dplyr
is possible only if the data you are working with is already in
a useful shape. When data is more messy, you will need to first manipulate it to bring it a tidy
format. For this, we will use tidyr
, which is very useful package to reshape data and to do
advanced cleaning of your data.
All these tidyverse functions are also called verbs. However, before getting to know these verbs,
let’s do an analysis using standard, or base R functions. This will be the benchmark against
which we are going to measure a {tidyverse}
workflow.
5.1 A data exploration exercice using base R
Let’s first load the starwars
data set, included in the {dplyr}
package:
Let’s first take a look at the data:
## # A tibble: 6 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
This data contains information on Star Wars characters. The first question you have to answer is to find the average height of the characters:
## [1] NA
Let’s also take a look at the standard deviation:
## [1] NA
It might be more informative to compute these two statistics by species, so for this, we are going
to use aggregate()
:
## Species x
## 1 Aleena 79.0000
## 2 Besalisk 198.0000
## 3 Cerean 198.0000
## 4 Chagrian 196.0000
## 5 Clawdite 168.0000
## 6 Droid NA
## 7 Dug 112.0000
## 8 Ewok 88.0000
## 9 Geonosian 183.0000
## 10 Gungan 208.6667
## 11 Human NA
## 12 Hutt 175.0000
## 13 Iktotchi 188.0000
## 14 Kaleesh 216.0000
## 15 Kaminoan 221.0000
## 16 Kel Dor 188.0000
## 17 Mirialan 168.0000
## 18 Mon Calamari 180.0000
## 19 Muun 191.0000
## 20 Nautolan 196.0000
## 21 Neimodian 191.0000
## 22 Pau'an 206.0000
## 23 Quermian 264.0000
## 24 Rodian 173.0000
## 25 Skakoan 193.0000
## 26 Sullustan 160.0000
## 27 Tholothian 184.0000
## 28 Togruta 178.0000
## 29 Toong 163.0000
## 30 Toydarian 137.0000
## 31 Trandoshan 190.0000
## 32 Twi'lek 179.0000
## 33 Vulptereen 94.0000
## 34 Wookiee 231.0000
## 35 Xexto 122.0000
## 36 Yoda's species 66.0000
## 37 Zabrak 173.0000
Even if you are not familiar with aggregate()
, I believe the above lines are quite self-explanatory.
You need to provide aggregate()
with 3 things; the variable you want to summarize (or only the
data frame, if you want to summarize all variables), a list of grouping variables and then the
function that will be applied to each subset. You can easily add another grouping variable:
aggregate(starwars$height,
by = list(Species = starwars$species,
Homeworld = starwars$homeworld),
mean)
## Species Homeworld x
## 1 Human Alderaan 176.3333
## 2 Aleena Aleen Minor 79.0000
## 3 Human Bespin 175.0000
## 4 Human Bestine IV 180.0000
## 5 Neimodian Cato Neimoidia 191.0000
## 6 Cerean Cerea 198.0000
## 7 Chagrian Champala 196.0000
## 8 Human Chandrila 150.0000
## 9 Human Concord Dawn 183.0000
## 10 Human Corellia 175.0000
## 11 Human Coruscant 168.5000
## 12 Tholothian Coruscant 184.0000
## 13 Zabrak Dathomir 175.0000
## 14 Kel Dor Dorin 188.0000
## 15 Ewok Endor 88.0000
## 16 Human Eriadu 180.0000
## 17 Geonosian Geonosis 183.0000
## 18 Nautolan Glee Anselm 196.0000
## 19 Human Haruun Kal 188.0000
## 20 Iktotchi Iktotch 188.0000
## 21 Zabrak Iridonia 171.0000
## 22 Kaleesh Kalee 216.0000
## 23 Human Kamino 183.0000
## 24 Kaminoan Kamino 221.0000
## 25 Wookiee Kashyyyk 231.0000
## 26 Dug Malastare 112.0000
## 27 Mirialan Mirial 168.0000
## 28 Mon Calamari Mon Cala 180.0000
## 29 Muun Muunilinst 191.0000
## 30 Droid Naboo 96.0000
## 31 Gungan Naboo 208.6667
## 32 Human Naboo 168.4000
## 33 Hutt Nal Hutta 175.0000
## 34 Besalisk Ojom 198.0000
## 35 Quermian Quermia 264.0000
## 36 Rodian Rodia 173.0000
## 37 Twi'lek Ryloth 179.0000
## 38 Human Serenno 193.0000
## 39 Togruta Shili 178.0000
## 40 Skakoan Skako 193.0000
## 41 Human Socorro 177.0000
## 42 Human Stewjon 182.0000
## 43 Sullustan Sullust 160.0000
## 44 Droid Tatooine 132.0000
## 45 Human Tatooine 179.2500
## 46 Toydarian Toydaria 137.0000
## 47 Trandoshan Trandosha 190.0000
## 48 Xexto Troiken 122.0000
## 49 Toong Tund 163.0000
## 50 Pau'an Utapau 206.0000
## 51 Vulptereen Vulpter 94.0000
## 52 Clawdite Zolan 168.0000
or use another function:
## Species x
## 1 Aleena NA
## 2 Besalisk NA
## 3 Cerean NA
## 4 Chagrian NA
## 5 Clawdite NA
## 6 Droid NA
## 7 Dug NA
## 8 Ewok NA
## 9 Geonosian NA
## 10 Gungan 14.189198
## 11 Human NA
## 12 Hutt NA
## 13 Iktotchi NA
## 14 Kaleesh NA
## 15 Kaminoan 11.313708
## 16 Kel Dor NA
## 17 Mirialan 2.828427
## 18 Mon Calamari NA
## 19 Muun NA
## 20 Nautolan NA
## 21 Neimodian NA
## 22 Pau'an NA
## 23 Quermian NA
## 24 Rodian NA
## 25 Skakoan NA
## 26 Sullustan NA
## 27 Tholothian NA
## 28 Togruta NA
## 29 Toong NA
## 30 Toydarian NA
## 31 Trandoshan NA
## 32 Twi'lek 1.414214
## 33 Vulptereen NA
## 34 Wookiee 4.242641
## 35 Xexto NA
## 36 Yoda's species NA
## 37 Zabrak 2.828427
aggregate()
returns a data.frame
object:
## [1] "data.frame"
tapply()
is another base R alternative:
## Aleena Besalisk Cerean Chagrian Clawdite
## 79.0000 198.0000 198.0000 196.0000 168.0000
## Droid Dug Ewok Geonosian Gungan
## NA 112.0000 88.0000 183.0000 208.6667
## Human Hutt Iktotchi Kaleesh Kaminoan
## NA 175.0000 188.0000 216.0000 221.0000
## Kel Dor Mirialan Mon Calamari Muun Nautolan
## 188.0000 168.0000 180.0000 191.0000 196.0000
## Neimodian Pau'an Quermian Rodian Skakoan
## 191.0000 206.0000 264.0000 173.0000 193.0000
## Sullustan Tholothian Togruta Toong Toydarian
## 160.0000 184.0000 178.0000 163.0000 137.0000
## Trandoshan Twi'lek Vulptereen Wookiee Xexto
## 190.0000 179.0000 94.0000 231.0000 122.0000
## Yoda's species Zabrak
## 66.0000 173.0000
which returns an array
object, which is similar to a vector.
However, tapply()
does not work if you want the mean by species for all the variables in the
data frame:
Error in tapply(starwars, list(starwars$species), mean) :
arguments must have same length
In both cases, you can only specify one function. So if you need the average and the standard deviation you have to do it in two steps.
Let’s continue now, by only computing the average height by species, but for males:
starwars_males <- subset(starwars, gender == "male")
aggregate(starwars_males$height,
by = list(Species = starwars_males$species),
mean)
## Species x
## 1 Aleena 79.0000
## 2 Besalisk 198.0000
## 3 Cerean 198.0000
## 4 Chagrian 196.0000
## 5 Dug 112.0000
## 6 Ewok 88.0000
## 7 Geonosian 183.0000
## 8 Gungan 208.6667
## 9 Human NA
## 10 Iktotchi 188.0000
## 11 Kaleesh 216.0000
## 12 Kaminoan 229.0000
## 13 Kel Dor 188.0000
## 14 Mon Calamari 180.0000
## 15 Muun 191.0000
## 16 Nautolan 196.0000
## 17 Neimodian 191.0000
## 18 Pau'an 206.0000
## 19 Quermian 264.0000
## 20 Rodian 173.0000
## 21 Skakoan 193.0000
## 22 Sullustan 160.0000
## 23 Toong 163.0000
## 24 Toydarian 137.0000
## 25 Trandoshan 190.0000
## 26 Twi'lek 180.0000
## 27 Vulptereen 94.0000
## 28 Wookiee 231.0000
## 29 Xexto 122.0000
## 30 Yoda's species 66.0000
## 31 Zabrak 173.0000
I first use subset()
to create a subset of the data in which I only kept males. Then, I rerun
the analysis from before again. subset()
can also be used to select columns. So if you want
the average of the height and mass for males, you could do something like this:
starwars_males_height_mass <- subset(starwars, gender == "male", select = c(height, mass, species))
aggregate(starwars_males_height_mass,
by = list(Species = starwars_males_height_mass$species),
mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Species height mass species
## 1 Aleena 79.0000 15 NA
## 2 Besalisk 198.0000 102 NA
## 3 Cerean 198.0000 82 NA
## 4 Chagrian 196.0000 NA NA
## 5 Dug 112.0000 40 NA
## 6 Ewok 88.0000 20 NA
## 7 Geonosian 183.0000 80 NA
## 8 Gungan 208.6667 NA NA
## 9 Human NA NA NA
## 10 Iktotchi 188.0000 NA NA
## 11 Kaleesh 216.0000 159 NA
## 12 Kaminoan 229.0000 88 NA
## 13 Kel Dor 188.0000 80 NA
## 14 Mon Calamari 180.0000 83 NA
## 15 Muun 191.0000 NA NA
## 16 Nautolan 196.0000 87 NA
## 17 Neimodian 191.0000 90 NA
## 18 Pau'an 206.0000 80 NA
## 19 Quermian 264.0000 NA NA
## 20 Rodian 173.0000 74 NA
## 21 Skakoan 193.0000 48 NA
## 22 Sullustan 160.0000 68 NA
## 23 Toong 163.0000 65 NA
## 24 Toydarian 137.0000 NA NA
## 25 Trandoshan 190.0000 113 NA
## 26 Twi'lek 180.0000 NA NA
## 27 Vulptereen 94.0000 45 NA
## 28 Wookiee 231.0000 124 NA
## 29 Xexto 122.0000 NA NA
## 30 Yoda's species 66.0000 17 NA
## 31 Zabrak 173.0000 NA NA
This is starting to get a bit verbose, but it is quite easy to follow and very powerful. It certainly beats having to write loops to achieve the same thing.
Let’s now consider this new dataset:
## id var1 var2 var3
## 1 1 1.0 0.2 0.3
## 2 2 1.4 1.9 4.1
## 3 3 0.1 2.8 8.9
## 4 4 1.7 1.9 7.6
I will explain later where this comes from. Depending on what you want to do with this data, it is
not in the right shape. So let’s reshape it, using the aptly-called reshape()
command:
survey_data_long <- reshape(survey_data_base,
varying = list(2:4), v.names = "variable", direction = "long")
We can now easily compute the average of variable
for each id
:
## Id x
## 1 1 0.500000
## 2 2 2.466667
## 3 3 3.933333
## 4 4 3.733333
There is also the possiblity to merge two datasets with merge()
. I won’t go into that however.
As you can see, R comes with very powerful functions right out of the box, ready to use. When I was
studying, unfortunately, my professors had been brought up on FORTRAN loops, so we had to do to all
this using loops (not reshaping, thankfully), which was not so easy.
Now that we have seen how base R works, let’s redo the analysis using {tidyverse}
verbs.
But before deep diving into the {tidyverse}
, let’s take a moment to discuss about our lord and
saviour, %>%
.
5.2 Smoking is bad for you, but pipes are your friend
The title of this section might sound weird at first, but by the end of it, you’ll get this (terrible) pun.
You probably know the following painting by René Magritte, La trahison des images:
It turns out there’s an R package from the tidyverse
that is called magrittr
. What does this
package do? It brings pipes to R. Pipes are a concept from the Unix operating system; if you’re
using a GNU+Linux distribution or macOS, you’re basically using a modern unix (that’s an
oversimplification, but I’m an economist by training, and outrageously oversimplifying things is
what we do, deal with it).
The idea of pipes is to take the output of a command, and feed it as the input of another
command. The magrittr
package brings pipes to R, by using the weird looking %>%
. Try the
following:
## [1] 4
This looks quite weird, but you probably understand what happened; 16
got fed as the first
argument of the function sqrt()
. You can chain multiple functions:
## [1] 22
The output of 16
(16
) got fed to sqrt()
, and the output of sqrt(16)
(4) got fed to +(18)
(22). Without %>%
you’d write the line just above like this:
## [1] 22
It might not be very clear right now why this is useful, but the %>%
is probably one of the
most useful infix operators, because when using packages from the tidyverse
, you will
naturally want to chain a lot of functions together. Without the %>%
it would become messy very fast.
%>%
is not the only pipe operator in magrittr
. There’s %T%
, %<>%
and %$%
. All have their
uses, but are basically shortcuts to some common tasks with %>%
plus another function. Which
means that you can live without them, and because of this, I will not discuss them.
5.3 The {tidyverse}
’s enfant prodige: {dplyr}
The best way to get started with the tidyverse packages is to get to know {dplyr}
. {dplyr}
prodives
a lot of very useful functions that makes it very easy to get discriptive statistics or add new columns
to your data.
5.3.1 A first taste of data manipulation with {dplyr}
This section will walk you through a typical analysis using {dplyr}
funcitons. Just go with it; I
will give more details in the next sections.
First, let’s load dplyr
and the included starwars
dataset. Let’s also take a look at the first 5
lines of the dataset:
## # A tibble: 6 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
data(starwars)
loads the example dataset called starwars
that is included in the package dplyr
.
As I said earlier, this is just an example; you could have loaded an external dataset, from a
.csv
file for instance. This does not matter for what comes next.
R includes a lot of functions for descriptive statistics, such as mean()
, sd()
, cov()
, and many
more. What dplyr
brings to the table (among other niceties) is the possibility to apply these
functions to the dataset easily. For example, imagine you want the average height of everyone in
the dataset. Using the basic R functions, you could write this:
## [1] NA
starwars$height
means that the user wants to access the column called height
from the dataset
starwars
. Remember that the $
symbol is how you access elements of a named list. This is the
same for columns of datasets as you can see. This is then given as an argument to the function
mean()
. But what if the user wants the average height by species? Before dplyr
, a solution to
this simple problem would have required more than a single command. Now this is as easy as:
## # A tibble: 38 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid NA
## 7 Dug 112
## 8 Ewok 88
## 9 Geonosian 183
## 10 Gungan 209.
## # ... with 28 more rows
The usefulness of the %>%
(pipe operator) becomes apparent now. Without it, one would write
instead:
## # A tibble: 38 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid NA
## 7 Dug 112
## 8 Ewok 88
## 9 Geonosian 183
## 10 Gungan 209.
## # ... with 28 more rows
as you can clearly see, it is much more difficult to read. Imagine now that I want the average height
by species, but only for males. Again, this is very easy using %>%
:
## # A tibble: 32 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Dug 112
## 6 Ewok 88
## 7 Geonosian 183
## 8 Gungan 209.
## 9 Human NA
## 10 Iktotchi 188
## # ... with 22 more rows
Again, the %>%
makes the above lines of code very easy to read. Without it, one would need to write:
## # A tibble: 32 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Dug 112
## 6 Ewok 88
## 7 Geonosian 183
## 8 Gungan 209.
## 9 Human NA
## 10 Iktotchi 188
## # ... with 22 more rows
I think you agree with me that this is not very readable. One way to make it more readable would be to save intermediary variables:
filtered_data <- filter(starwars, gender == "male")
grouped_data <- group_by(filter(starwars, gender == "male"), species)
summarise(grouped_data, mean(height))
## # A tibble: 32 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Dug 112
## 6 Ewok 88
## 7 Geonosian 183
## 8 Gungan 209.
## 9 Human NA
## 10 Iktotchi 188
## # ... with 22 more rows
But this can get very tedious. Once you’re used to %>%
, you won’t go back to not use it.
Before continuing and to make things clearer; filter()
, group_by()
and summarise()
are
functions that are included in dplyr
. %>%
is actually a function from magrittr
, but this
package gets loaded on the fly when you load dplyr
, so you do not need to worry about it.
mean()
is a function native to R.
The result of all these operations that use dplyr
functions are actually other datasets, or
tibbles
. This means that you can save them in variable, and then work with these as any other
datasets.
## [1] "tbl_df" "tbl" "data.frame"
## # A tibble: 6 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid NA
You could then write this data to disk using rio::export()
for instance. If you need more than the
mean of the height, you can keep adding as many functions as needed:
summary_table <- starwars %>%
group_by(species) %>%
summarise(ave_height = mean(height), var_height = var(height), n_obs = n())
print(summary_table)
## # A tibble: 38 x 4
## species ave_height var_height n_obs
## <chr> <dbl> <dbl> <int>
## 1 Aleena 79 NA 1
## 2 Besalisk 198 NA 1
## 3 Cerean 198 NA 1
## 4 Chagrian 196 NA 1
## 5 Clawdite 168 NA 1
## 6 Droid NA NA 5
## 7 Dug 112 NA 1
## 8 Ewok 88 NA 1
## 9 Geonosian 183 NA 1
## 10 Gungan 209. 201. 3
## # ... with 28 more rows
I’ve added more functions, namely var()
, to get the variance of height, and n()
, which
is a function from dplyr
, not base R, to get the number of observations. This is quite useful,
because we see that for a lot of species we only have one single individual! Let’s focus on the
species for which we have more than 1 individual. Since we save all the previous operations (which
produce a tibble
) in a variable, we can keep going from there:
## # A tibble: 9 x 4
## species ave_height var_height n_obs
## <chr> <dbl> <dbl> <int>
## 1 Droid NA NA 5
## 2 Gungan 209. 201. 3
## 3 Human NA NA 35
## 4 Kaminoan 221 128 2
## 5 Mirialan 168 8 2
## 6 Twi'lek 179 2 2
## 7 Wookiee 231 18 2
## 8 Zabrak 173 8 2
## 9 <NA> NA NA 5
There’s a lot of NA
s; this is because by default, mean()
and var()
return NA
if even one
single observation is NA
. This is good, because it forces you to look at the data
to see what is going on. If you would get a number, even if there were NA
s you could very easily
miss these missing values. It is better for functions to fail early and often than the opposite.
mean()
and var()
have a na.rm
option that the user can set to TRUE
to get the result by
ignoring the NA
s:
starwars %>%
group_by(species) %>%
summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>%
filter(n_obs > 1)
## # A tibble: 9 x 4
## species ave_height var_height n_obs
## <chr> <dbl> <dbl> <int>
## 1 Droid 140 2705. 5
## 2 Gungan 209. 201. 3
## 3 Human 177. 157. 35
## 4 Kaminoan 221 128 2
## 5 Mirialan 168 8 2
## 6 Twi'lek 179 2 2
## 7 Wookiee 231 18 2
## 8 Zabrak 173 8 2
## 9 <NA> 160 1826 5
In the code above, I have combined the two previous steps to get the result I’m interested in. There’s
a line in the final output that says NA
for the species. Let’s go back to the raw data and find
these lines:
## # A tibble: 5 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Ric … 183 NA brown fair blue NA male
## 2 Quar… 183 NA black dark brown 62 male
## 3 R4-P… 96 NA none silver, r… red, blue NA female
## 4 Sly … 178 48 none pale white NA female
## 5 Capt… NA NA unknown unknown unknown NA female
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
To test for NA
, one uses the function is.na()
not something like species == "NA"
or anything
like that. !is.na()
does the opposite:
## # A tibble: 82 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # ... with 72 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
The !
function negates a predicate function (a predicate function is a function that returns
TRUE
or FALSE
). We can then rerun our analysis from before:
starwars %>%
filter(!is.na(species)) %>%
group_by(species) %>%
summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>%
filter(n_obs > 1)
## # A tibble: 8 x 4
## species ave_height var_height n_obs
## <chr> <dbl> <dbl> <int>
## 1 Droid 140 2705. 5
## 2 Gungan 209. 201. 3
## 3 Human 177. 157. 35
## 4 Kaminoan 221 128 2
## 5 Mirialan 168 8 2
## 6 Twi'lek 179 2 2
## 7 Wookiee 231 18 2
## 8 Zabrak 173 8 2
And why not compute the same table, but first add another stratifying variable?
starwars %>%
filter(!is.na(species)) %>%
group_by(species, gender) %>%
summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>%
filter(n_obs > 1)
## # A tibble: 8 x 5
## # Groups: species [6]
## species gender ave_height var_height n_obs
## <chr> <chr> <dbl> <dbl> <int>
## 1 Droid none 200 NA 2
## 2 Droid <NA> 120 1657 3
## 3 Gungan male 209. 201. 3
## 4 Human female 160. 48.8 9
## 5 Human male 182. 67.1 26
## 6 Mirialan female 168 8 2
## 7 Wookiee male 231 18 2
## 8 Zabrak male 173 8 2
Ok, that’s it for a first taste. We have already discovered some very useful {dplyr}
functions,
filter()
, group_by()
and summarise summarise()
.
Now, we are going to learn more about these functions in more detail.
5.3.2 Filter the rows of a dataset with filter()
We’re going to use the Gasoline
dataset from the plm
package, so install that first:
Then load the required data:
and load dplyr:
This dataset gives the consumption of gasoline for 18 countries from 1960 to 1978. When you load
the data like this, it is a standard data.frame
. dplyr
functions can be used on standard
data.frame
objects, but also on tibble
s. tibble
s are just like data frame, but with a better
print method (and other niceties). I’ll discuss the {tibble}
package later, but for now, let’s
convert the data to a tibble
and change its name:
filter()
is pretty straightforward. What if you would like to subset the data to focus on the
year 1969? Simple:
## # A tibble: 18 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 3 CANADA 1969 4.86 -5.56 -1.04 -8.10
## 4 DENMARK 1969 4.17 -5.72 -0.407 -8.47
## 5 FRANCE 1969 3.77 -5.84 -0.315 -8.37
## 6 GERMANY 1969 3.90 -5.83 -0.589 -8.44
## 7 GREECE 1969 4.89 -6.59 -0.180 -10.7
## 8 IRELAND 1969 4.21 -6.38 -0.272 -8.95
## 9 ITALY 1969 3.74 -6.28 -0.248 -8.67
## 10 JAPAN 1969 4.52 -6.16 -0.417 -9.61
## 11 NETHERLA 1969 3.99 -5.88 -0.417 -8.63
## 12 NORWAY 1969 4.09 -5.74 -0.338 -8.69
## 13 SPAIN 1969 3.99 -5.60 0.669 -9.72
## 14 SWEDEN 1969 3.99 -7.77 -2.73 -8.20
## 15 SWITZERL 1969 4.21 -5.91 -0.918 -8.47
## 16 TURKEY 1969 5.72 -7.39 -0.298 -12.5
## 17 U.K. 1969 3.95 -6.03 -0.383 -8.47
## 18 U.S.A. 1969 4.84 -5.41 -1.22 -7.79
Let’s use %>%
, since we’re familiar with it now:
## # A tibble: 18 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 3 CANADA 1969 4.86 -5.56 -1.04 -8.10
## 4 DENMARK 1969 4.17 -5.72 -0.407 -8.47
## 5 FRANCE 1969 3.77 -5.84 -0.315 -8.37
## 6 GERMANY 1969 3.90 -5.83 -0.589 -8.44
## 7 GREECE 1969 4.89 -6.59 -0.180 -10.7
## 8 IRELAND 1969 4.21 -6.38 -0.272 -8.95
## 9 ITALY 1969 3.74 -6.28 -0.248 -8.67
## 10 JAPAN 1969 4.52 -6.16 -0.417 -9.61
## 11 NETHERLA 1969 3.99 -5.88 -0.417 -8.63
## 12 NORWAY 1969 4.09 -5.74 -0.338 -8.69
## 13 SPAIN 1969 3.99 -5.60 0.669 -9.72
## 14 SWEDEN 1969 3.99 -7.77 -2.73 -8.20
## 15 SWITZERL 1969 4.21 -5.91 -0.918 -8.47
## 16 TURKEY 1969 5.72 -7.39 -0.298 -12.5
## 17 U.K. 1969 3.95 -6.03 -0.383 -8.47
## 18 U.S.A. 1969 4.84 -5.41 -1.22 -7.79
You can also filter more than just one year, by using the %in%
operator:
## # A tibble: 90 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 AUSTRIA 1970 4.08 -6.08 -0.597 -8.73
## 3 AUSTRIA 1971 4.11 -6.04 -0.654 -8.64
## 4 AUSTRIA 1972 4.13 -5.98 -0.596 -8.54
## 5 AUSTRIA 1973 4.20 -5.90 -0.594 -8.49
## 6 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 7 BELGIUM 1970 3.87 -5.80 -0.378 -8.45
## 8 BELGIUM 1971 3.87 -5.76 -0.399 -8.41
## 9 BELGIUM 1972 3.91 -5.71 -0.311 -8.36
## 10 BELGIUM 1973 3.90 -5.64 -0.373 -8.31
## # ... with 80 more rows
It is also possible use between()
, a helper function:
## # A tibble: 90 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 AUSTRIA 1970 4.08 -6.08 -0.597 -8.73
## 3 AUSTRIA 1971 4.11 -6.04 -0.654 -8.64
## 4 AUSTRIA 1972 4.13 -5.98 -0.596 -8.54
## 5 AUSTRIA 1973 4.20 -5.90 -0.594 -8.49
## 6 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 7 BELGIUM 1970 3.87 -5.80 -0.378 -8.45
## 8 BELGIUM 1971 3.87 -5.76 -0.399 -8.41
## 9 BELGIUM 1972 3.91 -5.71 -0.311 -8.36
## 10 BELGIUM 1973 3.90 -5.64 -0.373 -8.31
## # ... with 80 more rows
To select non-consecutive years:
## # A tibble: 54 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 AUSTRIA 1973 4.20 -5.90 -0.594 -8.49
## 3 AUSTRIA 1977 3.93 -5.83 -0.422 -8.25
## 4 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 5 BELGIUM 1973 3.90 -5.64 -0.373 -8.31
## 6 BELGIUM 1977 3.85 -5.56 -0.432 -8.14
## 7 CANADA 1969 4.86 -5.56 -1.04 -8.10
## 8 CANADA 1973 4.90 -5.41 -1.13 -7.94
## 9 CANADA 1977 4.81 -5.34 -1.07 -7.77
## 10 DENMARK 1969 4.17 -5.72 -0.407 -8.47
## # ... with 44 more rows
%in%
tests if an object is part of a set.
5.3.3 Select columns with select()
While filter()
allows you to keep or discard rows of data, select()
allows you to keep or discard entire columns. To keep columns:
## # A tibble: 342 x 3
## country year lrpmg
## * <fct> <int> <dbl>
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # ... with 332 more rows
To discard them:
## # A tibble: 342 x 3
## lgaspcar lincomep lcarpcap
## * <dbl> <dbl> <dbl>
## 1 4.17 -6.47 -9.77
## 2 4.10 -6.43 -9.61
## 3 4.07 -6.41 -9.46
## 4 4.06 -6.37 -9.34
## 5 4.04 -6.32 -9.24
## 6 4.03 -6.29 -9.12
## 7 4.05 -6.25 -9.02
## 8 4.05 -6.23 -8.93
## 9 4.05 -6.21 -8.85
## 10 4.05 -6.15 -8.79
## # ... with 332 more rows
To rename them:
## # A tibble: 342 x 3
## country date lrpmg
## * <fct> <int> <dbl>
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # ... with 332 more rows
There’s also rename()
:
## # A tibble: 342 x 6
## country date lgaspcar lincomep lrpmg lcarpcap
## * <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
rename()
does not do any kind of selection, but just renames.
You can also use select()
to re-order columns:
## # A tibble: 342 x 6
## year country lrpmg lgaspcar lincomep lcarpcap
## * <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1960 AUSTRIA -0.335 4.17 -6.47 -9.77
## 2 1961 AUSTRIA -0.351 4.10 -6.43 -9.61
## 3 1962 AUSTRIA -0.380 4.07 -6.41 -9.46
## 4 1963 AUSTRIA -0.414 4.06 -6.37 -9.34
## 5 1964 AUSTRIA -0.445 4.04 -6.32 -9.24
## 6 1965 AUSTRIA -0.497 4.03 -6.29 -9.12
## 7 1966 AUSTRIA -0.467 4.05 -6.25 -9.02
## 8 1967 AUSTRIA -0.506 4.05 -6.23 -8.93
## 9 1968 AUSTRIA -0.522 4.05 -6.21 -8.85
## 10 1969 AUSTRIA -0.559 4.05 -6.15 -8.79
## # ... with 332 more rows
everything()
is a helper function, and there’s also starts_with()
,
and ends_with()
. For example, what if we are only interested
in columns whose name start with “l”?
## # A tibble: 342 x 4
## lgaspcar lincomep lrpmg lcarpcap
## * <dbl> <dbl> <dbl> <dbl>
## 1 4.17 -6.47 -0.335 -9.77
## 2 4.10 -6.43 -0.351 -9.61
## 3 4.07 -6.41 -0.380 -9.46
## 4 4.06 -6.37 -0.414 -9.34
## 5 4.04 -6.32 -0.445 -9.24
## 6 4.03 -6.29 -0.497 -9.12
## 7 4.05 -6.25 -0.467 -9.02
## 8 4.05 -6.23 -0.506 -8.93
## 9 4.05 -6.21 -0.522 -8.85
## 10 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
ends_with()
works in a similar fashion. There is also contains()
:
## # A tibble: 342 x 4
## country year lgaspcar lcarpcap
## * <fct> <int> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -9.77
## 2 AUSTRIA 1961 4.10 -9.61
## 3 AUSTRIA 1962 4.07 -9.46
## 4 AUSTRIA 1963 4.06 -9.34
## 5 AUSTRIA 1964 4.04 -9.24
## 6 AUSTRIA 1965 4.03 -9.12
## 7 AUSTRIA 1966 4.05 -9.02
## 8 AUSTRIA 1967 4.05 -8.93
## 9 AUSTRIA 1968 4.05 -8.85
## 10 AUSTRIA 1969 4.05 -8.79
## # ... with 332 more rows
Another verb, similar to select()
, is pull()
. Let’s compare the two:
## # A tibble: 342 x 1
## lrpmg
## * <dbl>
## 1 -0.335
## 2 -0.351
## 3 -0.380
## 4 -0.414
## 5 -0.445
## 6 -0.497
## 7 -0.467
## 8 -0.506
## 9 -0.522
## 10 -0.559
## # ... with 332 more rows
## [1] -0.33454761 -0.35132761 -0.37951769 -0.41425139 -0.44533536
## [6] -0.49706066 -0.46683773 -0.50588340 -0.52241255 -0.55911051
## [11] -0.59656122 -0.65445914 -0.59633184 -0.59444681 -0.46602693
## [16] -0.45414221 -0.50008372 -0.42191563 -0.46960312 -0.16570961
## [21] -0.17173098 -0.22229138 -0.25046225 -0.27591057 -0.34493695
## [26] -0.23639770 -0.26699499 -0.31116076 -0.35480852 -0.37794044
## [31] -0.39922992 -0.31064584 -0.37309192 -0.36223563 -0.36430848
## [36] -0.37896584 -0.43164133 -0.59094964 -0.97210650 -0.97229024
## [41] -0.97860756 -1.01904791 -1.00285696 -1.01712549 -1.01694436
## [46] -1.02359713 -1.01984524 -1.03686389 -1.06733308 -1.05803676
## [51] -1.09966703 -1.13316142 -1.12379997 -1.18568427 -1.06179659
## [56] -1.07084448 -1.07495073 -0.19570260 -0.25361844 -0.21875400
## [61] -0.24800936 -0.30654923 -0.32701542 -0.39618846 -0.44257369
## [66] -0.35204752 -0.40687922 -0.44046082 -0.45473954 -0.49918863
## [71] -0.43257185 -0.42517720 -0.39395431 -0.35361534 -0.35690917
## [76] -0.29068135 -0.01959833 -0.02386000 -0.06892022 -0.13792900
## [81] -0.19784646 -0.23365325 -0.26427164 -0.29405795 -0.32316179
## [86] -0.31519087 -0.33384616 -0.37945667 -0.40781642 -0.47503429
## [91] -0.21698191 -0.25838174 -0.24651309 -0.22550681 -0.38075942
## [96] -0.18591078 -0.23095384 -0.34384171 -0.37464672 -0.39965256
## [101] -0.43987825 -0.54000197 -0.54998139 -0.43824222 -0.58923137
## [106] -0.63329520 -0.67176311 -0.71797458 -0.72587521 -0.56982876
## [111] -0.56482380 -0.62481298 -0.59761210 -0.62817279 -0.08354740
## [116] -0.10421997 -0.13320751 -0.15653576 -0.18051772 -0.07793999
## [121] -0.11491900 -0.13775849 -0.15375883 -0.17986997 -0.20252426
## [126] -0.06761078 -0.11973059 -0.05191029 0.31625351 0.20631574
## [131] 0.19319312 0.23502961 0.16896037 -0.07648118 -0.12040874
## [136] -0.14160039 -0.15232915 -0.24428212 -0.16899366 -0.21071901
## [141] -0.17383533 -0.21339314 -0.27162842 -0.32069023 -0.36041067
## [146] -0.42393131 -0.64567297 -0.55343875 -0.64126416 -0.66134256
## [151] -0.56011483 -0.66277808 0.16507708 -0.08559038 -0.18351291
## [156] -0.26541405 -0.42609643 -0.32712637 -0.24887418 -0.19160048
## [161] -0.20616656 -0.24756681 -0.23271512 -0.14822267 -0.21508857
## [166] -0.32508487 -0.22290860 -0.03270913 0.10292798 0.16418805
## [171] 0.03482212 -0.14532271 -0.14874940 -0.18731459 -0.19996473
## [176] -0.20386433 -0.23786571 -0.27411537 -0.33167240 -0.35126918
## [181] -0.41685019 -0.46203546 -0.43941354 -0.52100094 -0.46270739
## [186] -0.19090636 -0.15948473 -0.20726559 -0.21904447 -0.28707638
## [191] -0.20148480 -0.21599265 -0.25968008 -0.29718661 -0.36929389
## [196] -0.34197503 -0.34809007 -0.31232019 -0.44450431 -0.41694955
## [201] -0.39954544 -0.43393029 -0.31903240 -0.42728193 -0.35253685
## [206] -0.43426178 -0.42908393 -0.46474195 -0.55791459 -0.13968957
## [211] -0.15790514 -0.19908809 -0.23263318 -0.26374731 -0.31593124
## [216] -0.25011726 -0.26555763 -0.30036775 -0.33823045 -0.39072560
## [221] -0.30127223 -0.26023925 -0.33880765 -0.15100924 -0.32726757
## [226] -0.35308752 -0.38255762 -0.30765935 1.12531070 1.10956235
## [231] 1.05700394 0.97683534 0.91532254 0.81666055 0.75671751
## [236] 0.74130811 0.70386453 0.66948950 0.61217208 0.60699563
## [241] 0.53716844 0.43377166 0.52492096 0.62955545 0.68385409
## [246] 0.52627167 0.62141374 -2.52041588 -2.57148340 -2.53448158
## [251] -2.60511224 -2.65801626 -2.64476790 -2.63901460 -2.65609762
## [256] -2.67918662 -2.73190414 -2.73359211 -2.77884554 -2.77467537
## [261] -2.84142900 -2.79840677 -2.76731461 -2.82294480 -2.82005896
## [266] -2.89649671 -0.82321833 -0.86558473 -0.82218510 -0.86012004
## [271] -0.86767682 -0.90528668 -0.85956665 -0.90656671 -0.87232520
## [276] -0.91812162 -0.96344188 -1.03746081 -0.94015345 -0.86722756
## [281] -0.88692306 -0.88475790 -0.90736205 -0.91147285 -1.03208811
## [286] -0.25340821 -0.34252375 -0.40820484 -0.22499174 -0.25219448
## [291] -0.29347614 -0.35640491 -0.33515022 -0.36507386 -0.29845417
## [296] -0.39882648 -0.30461880 -0.54637424 -0.69162023 -0.33965308
## [301] -0.53794675 -0.75141027 -0.95552413 -0.35290961 -0.39108581
## [306] -0.45185308 -0.42287690 -0.46335147 -0.49577430 -0.42654915
## [311] -0.47068145 -0.44118786 -0.46245080 -0.38332457 -0.41899030
## [316] -0.46135978 -0.52777246 -0.56529718 -0.56641296 -0.20867428
## [321] -0.27354010 -0.50886285 -0.78652911 -1.12111489 -1.14624034
## [326] -1.16187449 -1.17991524 -1.20026222 -1.19428750 -1.19026054
## [331] -1.18991215 -1.20730059 -1.22314272 -1.25176347 -1.28131560
## [336] -1.33116930 -1.29066967 -1.23146686 -1.20037697 -1.15468197
## [341] -1.17590974 -1.21206183
pull()
, unlike select()
, does not return a tibble
, but only the column you want.
5.3.4 Group the observations of your dataset with group_by()
group_by()
is a very useful verb; as the name implies, it allows you to create groups and then,
for example, compute descriptive statistics by groups. For example, let’s group our data by
country:
## # A tibble: 342 x 6
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap
## * <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
It looks like nothing much happened, but if you look at the second line of the output you can read the following:
this means that the data is grouped, and every computation you will do now will take these groups into account. It is also possible to group by more than one variable:
## # A tibble: 342 x 6
## # Groups: country, year [342]
## country year lgaspcar lincomep lrpmg lcarpcap
## * <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
and so on. You can then also ungroup:
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## * <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
Once your data is grouped, the operations that will follow will be executed inside each group.
5.3.5 Get summary statistics with summarise()
Ok, now that we have learned the basic verbs, we can start to do more interesting stuff. For example, one might want to compute the average gasoline consumption in each country, for the whole period:
## # A tibble: 18 x 2
## country `mean(lgaspcar)`
## <fct> <dbl>
## 1 AUSTRIA 4.06
## 2 BELGIUM 3.92
## 3 CANADA 4.86
## 4 DENMARK 4.19
## 5 FRANCE 3.82
## 6 GERMANY 3.89
## 7 GREECE 4.88
## 8 IRELAND 4.23
## 9 ITALY 3.73
## 10 JAPAN 4.70
## 11 NETHERLA 4.08
## 12 NORWAY 4.11
## 13 SPAIN 4.06
## 14 SWEDEN 4.01
## 15 SWITZERL 4.24
## 16 TURKEY 5.77
## 17 U.K. 3.98
## 18 U.S.A. 4.82
mean()
was given as an argument to summarise()
, which is a dplyr
verb. What we get is another
tibble, that contains the variable we used to group, as well as the average per country. We can
also rename this column:
## # A tibble: 18 x 2
## country mean_gaspcar
## <fct> <dbl>
## 1 AUSTRIA 4.06
## 2 BELGIUM 3.92
## 3 CANADA 4.86
## 4 DENMARK 4.19
## 5 FRANCE 3.82
## 6 GERMANY 3.89
## 7 GREECE 4.88
## 8 IRELAND 4.23
## 9 ITALY 3.73
## 10 JAPAN 4.70
## 11 NETHERLA 4.08
## 12 NORWAY 4.11
## 13 SPAIN 4.06
## 14 SWEDEN 4.01
## 15 SWITZERL 4.24
## 16 TURKEY 5.77
## 17 U.K. 3.98
## 18 U.S.A. 4.82
and because the output is a tibble
, we can continue to use dplyr
verbs on it:
gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar)) %>%
filter(country == "france")
## # A tibble: 0 x 2
## # ... with 2 variables: country <fct>, mean_gaspcar <dbl>
summarise()
is a very useful verb. For example, we can compute several descriptive statistics at once:
gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar),
sd_gaspcar = sd(lgaspcar),
max_gaspcar = max(lgaspcar),
min_gaspcar = min(lgaspcar))
## # A tibble: 18 x 5
## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 0.0693 4.20 3.92
## 2 BELGIUM 3.92 0.103 4.16 3.82
## 3 CANADA 4.86 0.0262 4.90 4.81
## 4 DENMARK 4.19 0.158 4.50 4.00
## 5 FRANCE 3.82 0.0499 3.91 3.75
## 6 GERMANY 3.89 0.0239 3.93 3.85
## 7 GREECE 4.88 0.255 5.38 4.48
## 8 IRELAND 4.23 0.0437 4.33 4.16
## 9 ITALY 3.73 0.220 4.05 3.38
## 10 JAPAN 4.70 0.684 6.00 3.95
## 11 NETHERLA 4.08 0.286 4.65 3.71
## 12 NORWAY 4.11 0.123 4.44 3.96
## 13 SPAIN 4.06 0.317 4.75 3.62
## 14 SWEDEN 4.01 0.0364 4.07 3.91
## 15 SWITZERL 4.24 0.102 4.44 4.05
## 16 TURKEY 5.77 0.329 6.16 5.14
## 17 U.K. 3.98 0.0479 4.10 3.91
## 18 U.S.A. 4.82 0.0219 4.86 4.79
Because the output is a tibble
, you can save it in a variable of course:
desc_gasoline <- gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar),
sd_gaspcar = sd(lgaspcar),
max_gaspcar = max(lgaspcar),
min_gaspcar = min(lgaspcar))
And then you can answer questions such as, which country has the maximum average gasoline consumption?:
## # A tibble: 1 x 5
## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 TURKEY 5.77 0.329 6.16 5.14
Turns out it’s Turkey. What about the minimum consumption?
## # A tibble: 1 x 5
## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 ITALY 3.73 0.220 4.05 3.38
Because the output of dplyr
verbs is a tibble, it is possible to continue working with it. This
is one shortcoming of using the base summary()
function. The object returned by that function
is not very easy to manipulate.
5.3.6 Adding columns with mutate()
and transmute()
mutate()
adds a column to the tibble
, which can contain any transformation of any other
variable:
## # A tibble: 342 x 7
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap `n()`
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77 19
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61 19
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46 19
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34 19
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24 19
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12 19
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02 19
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93 19
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85 19
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79 19
## # ... with 332 more rows
Using mutate()
I’ve added a column that counts how many times the country appears in the tibble
,
using n()
, another dplyr
function. There’s also count()
and tally()
, which we are going to
see further down. It is also possible to rename the column on the fly:
## # A tibble: 342 x 7
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap count
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77 19
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61 19
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46 19
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34 19
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24 19
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12 19
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02 19
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93 19
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85 19
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79 19
## # ... with 332 more rows
It is possible to do any arbitrary operation:
## # A tibble: 342 x 7
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap spam
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77 0.100
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61 0.0978
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46 0.0969
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34 0.0991
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24 0.102
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12 0.104
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02 0.110
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93 0.113
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85 0.115
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79 0.122
## # ... with 332 more rows
transmute()
is the same as mutate()
, but only returns the created variable:
## # A tibble: 342 x 2
## # Groups: country [18]
## country spam
## <fct> <dbl>
## 1 AUSTRIA 0.100
## 2 AUSTRIA 0.0978
## 3 AUSTRIA 0.0969
## 4 AUSTRIA 0.0991
## 5 AUSTRIA 0.102
## 6 AUSTRIA 0.104
## 7 AUSTRIA 0.110
## 8 AUSTRIA 0.113
## 9 AUSTRIA 0.115
## 10 AUSTRIA 0.122
## # ... with 332 more rows
5.3.7 Joining tibble
s with full_join()
, left_join()
, right_join()
and all the others
I will end this section on dplyr
with the very useful verbs: the *_join()
verbs. Let’s first
start by loading another dataset from the plm
package. SumHes
and let’s convert it to tibble
and rename it:
Let’s take a quick look at the data:
## Observations: 3,250
## Variables: 7
## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, ...
## $ country <chr> "algeria", "algeria", "algeria", "algeria", "algeria",...
## $ opec <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no...
## $ com <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no...
## $ pop <int> 10800, 11016, 11236, 11460, 11690, 11923, 12267, 12622...
## $ gdp <int> 1723, 1599, 1275, 1517, 1589, 1584, 1548, 1600, 1758, ...
## $ sr <dbl> 19.9, 21.1, 15.0, 13.9, 10.6, 11.0, 8.3, 11.3, 15.1, 1...
We can merge both gasoline
and pwt
by country and year, as these two variables are common to
both datasets. There are more countries and years in the pwt
dataset, so when merging both, and
depending on which function you use, you will either have NA
’s for the variables where there is
no match, or rows that will be dropped. Let’s start with full_join
:
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
Let’s see which countries and years are included:
## # A tibble: 3,592 x 3
## country year n
## <chr> <int> <int>
## 1 algeria 1960 1
## 2 algeria 1961 1
## 3 algeria 1962 1
## 4 algeria 1963 1
## 5 algeria 1964 1
## 6 algeria 1965 1
## 7 algeria 1966 1
## 8 algeria 1967 1
## 9 algeria 1968 1
## 10 algeria 1969 1
## # ... with 3,582 more rows
As you see, every country and year was included, but what happened for, say, the U.S.S.R? This country
is in pwt
but not in gasoline
at all:
## # A tibble: 26 x 11
## country year lgaspcar lincomep lrpmg lcarpcap opec com pop gdp
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <int> <int>
## 1 u.s.s.… 1960 NA NA NA NA no yes 214400 2397
## 2 u.s.s.… 1961 NA NA NA NA no yes 217896 2542
## 3 u.s.s.… 1962 NA NA NA NA no yes 221449 2656
## 4 u.s.s.… 1963 NA NA NA NA no yes 225060 2681
## 5 u.s.s.… 1964 NA NA NA NA no yes 227571 2854
## 6 u.s.s.… 1965 NA NA NA NA no yes 230109 3049
## 7 u.s.s.… 1966 NA NA NA NA no yes 232676 3247
## 8 u.s.s.… 1967 NA NA NA NA no yes 235272 3454
## 9 u.s.s.… 1968 NA NA NA NA no yes 237896 3730
## 10 u.s.s.… 1969 NA NA NA NA no yes 240550 3808
## # ... with 16 more rows, and 1 more variable: sr <dbl>
As you probably guessed, the variables from gasoline
that are not included in pwt
are filled with
NA
s. One could remove all these lines and only keep countries for which these variables are not
NA
everywhere with filter()
, but there is a simpler solution:
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
Let’s use the tabyl()
from the janitor
packages which is a very nice alternative to the table()
function from base R:
## [1] country n percent
## <0 rows> (or 0-length row.names)
Only countries with values in both datasets were returned. It’s almost every country from gasoline
,
apart from Germany (called “germany west” in pwt
and “germany” in gasoline
. I left it as is to
provide an example of a country not in pwt
). Let’s also look at the variables:
## Observations: 0
## Variables: 11
## $ country <chr>
## $ year <int>
## $ lgaspcar <dbl>
## $ lincomep <dbl>
## $ lrpmg <dbl>
## $ lcarpcap <dbl>
## $ opec <fct>
## $ com <fct>
## $ pop <int>
## $ gdp <int>
## $ sr <dbl>
The variables from both datasets are in the joined data.
Contrast this to semi_join()
:
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
## Observations: 0
## Variables: 6
## $ country <fct>
## $ year <int>
## $ lgaspcar <dbl>
## $ lincomep <dbl>
## $ lrpmg <dbl>
## $ lcarpcap <dbl>
## country n percent
## AUSTRIA 0 NaN
## BELGIUM 0 NaN
## CANADA 0 NaN
## DENMARK 0 NaN
## FRANCE 0 NaN
## GERMANY 0 NaN
## GREECE 0 NaN
## IRELAND 0 NaN
## ITALY 0 NaN
## JAPAN 0 NaN
## NETHERLA 0 NaN
## NORWAY 0 NaN
## SPAIN 0 NaN
## SWEDEN 0 NaN
## SWITZERL 0 NaN
## TURKEY 0 NaN
## U.K. 0 NaN
## U.S.A. 0 NaN
Only columns of gasoline
are returned, and only rows of gasoline
that were matched with rows
from pwt
. semi_join()
is not a commutative operation:
## Warning: Column `country` joining character vector and factor, coercing
## into character vector
## Observations: 0
## Variables: 7
## $ year <int>
## $ country <chr>
## $ opec <fct>
## $ com <fct>
## $ pop <int>
## $ gdp <int>
## $ sr <dbl>
## country n percent
## AUSTRIA 0 NaN
## BELGIUM 0 NaN
## CANADA 0 NaN
## DENMARK 0 NaN
## FRANCE 0 NaN
## GERMANY 0 NaN
## GREECE 0 NaN
## IRELAND 0 NaN
## ITALY 0 NaN
## JAPAN 0 NaN
## NETHERLA 0 NaN
## NORWAY 0 NaN
## SPAIN 0 NaN
## SWEDEN 0 NaN
## SWITZERL 0 NaN
## TURKEY 0 NaN
## U.K. 0 NaN
## U.S.A. 0 NaN
The rows are the same, but not the columns.
left_join()
and right_join()
return all the rows from either the dataset that is on the
“left” (the first argument of the fonction) or on the “right” (the second argument of the
function) but all columns from both datasets. So depending on which countries you’re interested in,
you’re going to use either one of these functions:
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
## country n percent
## AUSTRIA 19 0.05555556
## BELGIUM 19 0.05555556
## CANADA 19 0.05555556
## DENMARK 19 0.05555556
## FRANCE 19 0.05555556
## GERMANY 19 0.05555556
## GREECE 19 0.05555556
## IRELAND 19 0.05555556
## ITALY 19 0.05555556
## JAPAN 19 0.05555556
## NETHERLA 19 0.05555556
## NORWAY 19 0.05555556
## SPAIN 19 0.05555556
## SWEDEN 19 0.05555556
## SWITZERL 19 0.05555556
## TURKEY 19 0.05555556
## U.K. 19 0.05555556
## U.S.A. 19 0.05555556
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
## country n percent
## algeria 26 0.008
## angola 26 0.008
## argentina 26 0.008
## australia 26 0.008
## austria 26 0.008
## bangladesh 26 0.008
## barbados 26 0.008
## belgium 26 0.008
## benin 26 0.008
## bolivia 26 0.008
## botswana 26 0.008
## brazil 26 0.008
## burkina faso 26 0.008
## burundi 26 0.008
## cameroon 26 0.008
## canada 26 0.008
## cape verde is. 26 0.008
## central afr.r. 26 0.008
## chad 26 0.008
## chile 26 0.008
## china 26 0.008
## colombia 26 0.008
## comoros 26 0.008
## congo 26 0.008
## costa rica 26 0.008
## cyprus 26 0.008
## czechoslovakia 26 0.008
## denmark 26 0.008
## dominican rep. 26 0.008
## ecuador 26 0.008
## egypt 26 0.008
## el salvador 26 0.008
## ethiopia 26 0.008
## fiji 26 0.008
## finland 26 0.008
## france 26 0.008
## gabon 26 0.008
## gambia 26 0.008
## germany west 26 0.008
## ghana 26 0.008
## greece 26 0.008
## guatemala 26 0.008
## guinea 26 0.008
## guinea-biss 26 0.008
## guyana 26 0.008
## haiti 26 0.008
## honduras 26 0.008
## hong kong 26 0.008
## iceland 26 0.008
## india 26 0.008
## indonesia 26 0.008
## iran 26 0.008
## iraq 26 0.008
## ireland 26 0.008
## israel 26 0.008
## italy 26 0.008
## ivory coast 26 0.008
## jamaica 26 0.008
## japan 26 0.008
## jordan 26 0.008
## kenya 26 0.008
## korea 26 0.008
## lesotho 26 0.008
## liberia 26 0.008
## luxembourg 26 0.008
## madagascar 26 0.008
## malawi 26 0.008
## malaysia 26 0.008
## mali 26 0.008
## malta 26 0.008
## mauritania 26 0.008
## mauritius 26 0.008
## mexico 26 0.008
## morocco 26 0.008
## mozambique 26 0.008
## myanmar 26 0.008
## namibia 26 0.008
## nepal 26 0.008
## netherlands 26 0.008
## new zealand 26 0.008
## nicaragua 26 0.008
## niger 26 0.008
## nigeria 26 0.008
## norway 26 0.008
## pakistan 26 0.008
## panama 26 0.008
## papua n.guinea 26 0.008
## paraguay 26 0.008
## peru 26 0.008
## philippines 26 0.008
## portugal 26 0.008
## puerto rico 26 0.008
## reunion 26 0.008
## romania 26 0.008
## rwanda 26 0.008
## saudi arabia 26 0.008
## senegal 26 0.008
## seychelles 26 0.008
## singapore 26 0.008
## somalia 26 0.008
## south africa 26 0.008
## spain 26 0.008
## sri lanka 26 0.008
## suriname 26 0.008
## swaziland 26 0.008
## sweden 26 0.008
## switzerland 26 0.008
## syria 26 0.008
## taiwan 26 0.008
## tanzania 26 0.008
## thailand 26 0.008
## togo 26 0.008
## trinidad&tobago 26 0.008
## tunisia 26 0.008
## turkey 26 0.008
## u.k. 26 0.008
## u.s.a. 26 0.008
## u.s.s.r. 26 0.008
## uganda 26 0.008
## uruguay 26 0.008
## venezuela 26 0.008
## yugoslavia 26 0.008
## zaire 26 0.008
## zambia 26 0.008
## zimbabwe 26 0.008
The last merge function is anti_join()
:
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
## Observations: 342
## Variables: 6
## $ country <fct> AUSTRIA, AUSTRIA, AUSTRIA, AUSTRIA, AUSTRIA, AUSTRIA,...
## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968,...
## $ lgaspcar <dbl> 4.173244, 4.100989, 4.073177, 4.059509, 4.037689, 4.0...
## $ lincomep <dbl> -6.474277, -6.426006, -6.407308, -6.370679, -6.322247...
## $ lrpmg <dbl> -0.3345476, -0.3513276, -0.3795177, -0.4142514, -0.44...
## $ lcarpcap <dbl> -9.766840, -9.608622, -9.457257, -9.343155, -9.237739...
## country n percent
## AUSTRIA 19 0.05555556
## BELGIUM 19 0.05555556
## CANADA 19 0.05555556
## DENMARK 19 0.05555556
## FRANCE 19 0.05555556
## GERMANY 19 0.05555556
## GREECE 19 0.05555556
## IRELAND 19 0.05555556
## ITALY 19 0.05555556
## JAPAN 19 0.05555556
## NETHERLA 19 0.05555556
## NORWAY 19 0.05555556
## SPAIN 19 0.05555556
## SWEDEN 19 0.05555556
## SWITZERL 19 0.05555556
## TURKEY 19 0.05555556
## U.K. 19 0.05555556
## U.S.A. 19 0.05555556
gas_pwt_anti
has the columns the gasoline
dataset as well as the only country from gasoline
that is not in pwt
: “germany”.
That was it for the basic {dplyr}
verbs. Next, we’re going to learn about {tidyr}
.
5.4 Reshaping data with tidyr
Another important package from the tidyverse
that goes hand in hand with dplyr
is tidyr
. tidyr
is the package you need when it’s time to reshape data. The basic functions from tidyr
, spread()
and gather()
make it possible to go from long to wide datasets respectively.
5.4.1 spread()
and gather()
Let’s first create a fake dataset:
survey_data <- tribble(
~id, ~variable, ~value,
1, "var1", 1,
1, "var2", 0.2,
NA, "var3", 0.3,
2, "var1", 1.4,
2, "var2", 1.9,
2, "var3", 4.1,
3, "var1", 0.1,
3, "var2", 2.8,
3, "var3", 8.9,
4, "var1", 1.7,
NA, "var2", 1.9,
4, "var3", 7.6
)
head(survey_data)
## # A tibble: 6 x 3
## id variable value
## <dbl> <chr> <dbl>
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1 1.4
## 5 2 var2 1.9
## 6 2 var3 4.1
I used the tribble()
function from the {tibble}
package to create this fake dataset.
I’ll discuss this package later, for now, let’s focus on {tidyr}.
survey_data
is a long dataset. We can reshape it to be wide using the spread()
function:
## # A tibble: 5 x 4
## id var1 var2 var3
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0.2 NA
## 2 2 1.4 1.9 4.1
## 3 3 0.1 2.8 8.9
## 4 4 1.7 NA 7.6
## 5 NA NA 1.9 0.3
This means that we spread the column called “variable”, which will produce one column per category of “variable”. Then we fill in the rows with the data contained in the column “value”.
To go from a wide dataset to a long one, we use gather()
:
## # A tibble: 10 x 4
## id var3 variable value
## <dbl> <dbl> <chr> <dbl>
## 1 1 NA var1 1
## 2 2 4.1 var1 1.4
## 3 3 8.9 var1 0.1
## 4 4 7.6 var1 1.7
## 5 NA 0.3 var1 NA
## 6 1 NA var2 0.2
## 7 2 4.1 var2 1.9
## 8 3 8.9 var2 2.8
## 9 4 7.6 var2 NA
## 10 NA 0.3 var2 1.9
long_data
and survey_data
are the same datasets, but in a different order.
In the wide_data
tibble
, we had 3 columns: id
, var1
and var2
. We want to stack ‘var1’ and
‘var2’ in a new column, that we choose to call “variable”. This is the “key”. For the value, we are
using the values contained in var1
and var2
. Sometimes using spread()
or gather()
requires
some trial and error. I advise you play around with the examples above to really grasp how these
powerful functions work.
5.4.2 fill()
and full_seq()
fill()
is pretty useful to… fill in missing values. For instance, in survey_data
, some “id”s
are missing:
## # A tibble: 12 x 3
## id variable value
## <dbl> <chr> <dbl>
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1 1.4
## 5 2 var2 1.9
## 6 2 var3 4.1
## 7 3 var1 0.1
## 8 3 var2 2.8
## 9 3 var3 8.9
## 10 4 var1 1.7
## 11 NA var2 1.9
## 12 4 var3 7.6
It seems pretty obvious that the first NA
is supposed to be 1
and the second missing is supposed
to be 4
. With fill()
, this is pretty easy to achieve:
## # A tibble: 12 x 3
## id variable value
## <dbl> <chr> <dbl>
## 1 1 var1 1
## 2 1 var2 0.2
## 3 1 var3 0.3
## 4 2 var1 1.4
## 5 2 var2 1.9
## 6 2 var3 4.1
## 7 3 var1 0.1
## 8 3 var2 2.8
## 9 3 var3 8.9
## 10 4 var1 1.7
## 11 4 var2 1.9
## 12 4 var3 7.6
full_seq()
is similar:
## [1] "2018-08-01" "2018-08-02" "2018-08-03"
We can add this as the date column to our survey data:
## # A tibble: 12 x 4
## id variable value date
## <dbl> <chr> <dbl> <date>
## 1 1 var1 1 2018-08-01
## 2 1 var2 0.2 2018-08-02
## 3 NA var3 0.3 2018-08-03
## 4 2 var1 1.4 2018-08-01
## 5 2 var2 1.9 2018-08-02
## 6 2 var3 4.1 2018-08-03
## 7 3 var1 0.1 2018-08-01
## 8 3 var2 2.8 2018-08-02
## 9 3 var3 8.9 2018-08-03
## 10 4 var1 1.7 2018-08-01
## 11 NA var2 1.9 2018-08-02
## 12 4 var3 7.6 2018-08-03
I use the base rep()
function to repeat the date 4 times and then using mutate()
I have added
it the data frame.
Putting all these operations together:
survey_data %>%
fill(.direction = "down", variable = "id") %>%
mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
spread(variable, value)
## # A tibble: 12 x 5
## id date var1 var2 var3
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1 2018-08-01 1 NA NA
## 2 1 2018-08-02 NA 0.2 NA
## 3 1 2018-08-03 NA NA 0.3
## 4 2 2018-08-01 1.4 NA NA
## 5 2 2018-08-02 NA 1.9 NA
## 6 2 2018-08-03 NA NA 4.1
## 7 3 2018-08-01 0.1 NA NA
## 8 3 2018-08-02 NA 2.8 NA
## 9 3 2018-08-03 NA NA 8.9
## 10 4 2018-08-01 1.7 NA NA
## 11 4 2018-08-02 NA 1.9 NA
## 12 4 2018-08-03 NA NA 7.6
As you can see, this creates a lot of explicit NA
values. The best would be to fill in the missing
values and add the date column, and then work with that format. For example, to get the average
of the values by date:
survey_data %>%
fill(.direction = "down", variable = "id") %>%
mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
group_by(date) %>%
summarise(mean(value))
## # A tibble: 3 x 2
## date `mean(value)`
## <date> <dbl>
## 1 2018-08-01 1.05
## 2 2018-08-02 1.7
## 3 2018-08-03 5.22
Or by date
and variable
:
survey_data %>%
fill(.direction = "down", variable = "id") %>%
mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
group_by(date, variable) %>%
summarise(mean(value))
## # A tibble: 3 x 3
## # Groups: date [?]
## date variable `mean(value)`
## <date> <chr> <dbl>
## 1 2018-08-01 var1 1.05
## 2 2018-08-02 var2 1.7
## 3 2018-08-03 var3 5.22
As you can see, you can chain any {tidyverse}
verbs, wether they come from {dplyr}
or {tidyr}
.
5.4.3 Put order in your columns with separate()
, unite()
, and in your rows with separate_rows()
Sometimes, data can be in a format that makes working with it needlessly painful. For example, you get this:
## # A tibble: 12 x 3
## id variable_date value
## <dbl> <chr> <dbl>
## 1 1 var1/2018-08-01 1
## 2 1 var2/2018-08-02 0.2
## 3 1 var3/2018-08-03 0.3
## 4 2 var1/2018-08-01 1.4
## 5 2 var2/2018-08-02 1.9
## 6 2 var3/2018-08-03 4.1
## 7 3 var1/2018-08-01 0.1
## 8 3 var2/2018-08-02 2.8
## 9 3 var3/2018-08-03 8.9
## 10 4 var1/2018-08-01 1.7
## 11 4 var2/2018-08-02 1.9
## 12 4 var3/2018-08-03 7.6
Dealing with this is simple, thanks to separate()
:
## # A tibble: 12 x 4
## id variable date value
## <dbl> <chr> <chr> <dbl>
## 1 1 var1 2018-08-01 1
## 2 1 var2 2018-08-02 0.2
## 3 1 var3 2018-08-03 0.3
## 4 2 var1 2018-08-01 1.4
## 5 2 var2 2018-08-02 1.9
## 6 2 var3 2018-08-03 4.1
## 7 3 var1 2018-08-01 0.1
## 8 3 var2 2018-08-02 2.8
## 9 3 var3 2018-08-03 8.9
## 10 4 var1 2018-08-01 1.7
## 11 4 var2 2018-08-02 1.9
## 12 4 var3 2018-08-03 7.6
The variable_date
column gets separated into two columns, variable
and date
. One also needs
to specify the separator, in this case “/”.
unite()
is the reverse operation, which can be useful when you are confronted to this situation:
## # A tibble: 12 x 6
## id variable year month day value
## <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 1 var1 2018 08 01 1
## 2 1 var2 2018 08 02 0.2
## 3 1 var3 2018 08 03 0.3
## 4 2 var1 2018 08 01 1.4
## 5 2 var2 2018 08 02 1.9
## 6 2 var3 2018 08 03 4.1
## 7 3 var1 2018 08 01 0.1
## 8 3 var2 2018 08 02 2.8
## 9 3 var3 2018 08 03 8.9
## 10 4 var1 2018 08 01 1.7
## 11 4 var2 2018 08 02 1.9
## 12 4 var3 2018 08 03 7.6
In some situation, it is better to have the date as a single column:
## # A tibble: 12 x 4
## id variable date value
## <dbl> <chr> <chr> <dbl>
## 1 1 var1 2018-08-01 1
## 2 1 var2 2018-08-02 0.2
## 3 1 var3 2018-08-03 0.3
## 4 2 var1 2018-08-01 1.4
## 5 2 var2 2018-08-02 1.9
## 6 2 var3 2018-08-03 4.1
## 7 3 var1 2018-08-01 0.1
## 8 3 var2 2018-08-02 2.8
## 9 3 var3 2018-08-03 8.9
## 10 4 var1 2018-08-01 1.7
## 11 4 var2 2018-08-02 1.9
## 12 4 var3 2018-08-03 7.6
Another awful situation is the following:
## id variable value
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1, var2, var3 1.4, 1.9, 4.1
## 5 3 var1, var2 0.1, 2.8
## 6 3 var3 8.9
## 7 4 var1 1.7
## 8 NA var2 1.9
## 9 4 var3 7.6
separate_rows()
saves the day:
## id variable value
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1 1.4
## 5 2 var2 1.9
## 6 2 var3 4.1
## 7 3 var1 0.1
## 8 3 var2 2.8
## 9 3 var3 8.9
## 10 4 var1 1.7
## 11 NA var2 1.9
## 12 4 var3 7.6
So to summarise… you can go from this:
## id variable value
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1, var2, var3 1.4, 1.9, 4.1
## 5 3 var1, var2 0.1, 2.8
## 6 3 var3 8.9
## 7 4 var1 1.7
## 8 NA var2 1.9
## 9 4 var3 7.6
to this:
## # A tibble: 12 x 4
## id variable date value
## <dbl> <chr> <chr> <dbl>
## 1 1 var1 2018-08-01 1
## 2 1 var2 2018-08-02 0.2
## 3 1 var3 2018-08-03 0.3
## 4 2 var1 2018-08-01 1.4
## 5 2 var2 2018-08-02 1.9
## 6 2 var3 2018-08-03 4.1
## 7 3 var1 2018-08-01 0.1
## 8 3 var2 2018-08-02 2.8
## 9 3 var3 2018-08-03 8.9
## 10 4 var1 2018-08-01 1.7
## 11 4 var2 2018-08-02 1.9
## 12 4 var3 2018-08-03 7.6
quite easily:
survey_data_from_hell %>%
separate_rows(variable, value, convert = TRUE) %>%
fill(.direction = "down", variable = "id") %>%
mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4))
## id variable value date
## 1 1 var1 1.0 2018-08-01
## 2 1 var2 0.2 2018-08-02
## 3 1 var3 0.3 2018-08-03
## 4 2 var1 1.4 2018-08-01
## 5 2 var2 1.9 2018-08-02
## 6 2 var3 4.1 2018-08-03
## 7 3 var1 0.1 2018-08-01
## 8 3 var2 2.8 2018-08-02
## 9 3 var3 8.9 2018-08-03
## 10 4 var1 1.7 2018-08-01
## 11 4 var2 1.9 2018-08-02
## 12 4 var3 7.6 2018-08-03
5.4.4 A sneak peek to Chapter 9; nest()
Let’s take a look at our clean survey data:
## # A tibble: 12 x 4
## id variable date value
## <dbl> <chr> <chr> <dbl>
## 1 1 var1 2018-08-01 1
## 2 1 var2 2018-08-02 0.2
## 3 1 var3 2018-08-03 0.3
## 4 2 var1 2018-08-01 1.4
## 5 2 var2 2018-08-02 1.9
## 6 2 var3 2018-08-03 4.1
## 7 3 var1 2018-08-01 0.1
## 8 3 var2 2018-08-02 2.8
## 9 3 var3 2018-08-03 8.9
## 10 4 var1 2018-08-01 1.7
## 11 4 var2 2018-08-02 1.9
## 12 4 var3 2018-08-03 7.6
{tidyr}
has another very useful function, called nest()
:
## # A tibble: 4 x 2
## id data
## <dbl> <list>
## 1 1 <tibble [3 × 3]>
## 2 2 <tibble [3 × 3]>
## 3 3 <tibble [3 × 3]>
## 4 4 <tibble [3 × 3]>
You can achieve the same result by using group_by()
first:
## # A tibble: 4 x 2
## id data
## <dbl> <list>
## 1 1 <tibble [3 × 3]>
## 2 2 <tibble [3 × 3]>
## 3 3 <tibble [3 × 3]>
## 4 4 <tibble [3 × 3]>
This cerates a new tibble object, whith two columns, one with the id
column, and a new one called
data
. Let’s take a look at the first element of this column:
## [[1]]
## # A tibble: 3 x 3
## variable date value
## <chr> <chr> <dbl>
## 1 var1 2018-08-01 1
## 2 var2 2018-08-02 0.2
## 3 var3 2018-08-03 0.3
As you can see, the first element of the data
is also a tibble! You may wonder why this is useful;
I will give you a small taste of what’s waiting in chapter 9. Imagine that you want to create a
barplot for each id
. For instance, for the first id
(we are going to learn about making plots
with {ggplot2}
in the next chapter. For now, just follow along, even if you don’t understand
everything I write):
Now, let’s create the plot:
ggplot(data = survey_data_id1) +
geom_bar(aes(y = value, x = date, fill = variable), stat = "identity") +
ggtitle("id 1")
Ok great. But now I want to create this plot for each id
… so I have to copy paste this 3 times.
But copy-pasting is error prone. So there are two alternatives; I either write a function that
takes as argument the data and the id
I want to plot and run it for each id
(we will learn to
do this in Chapter 8), or I can use tidyr::nest()
, combined with purrr::map()
. {purrr}
is
another very useful {tidevyrse}
package, and we are going to learn about it in Chapter 9. Again,
just follow along for now:
my_plots <- nested_survey_data %>%
mutate(plot = map2(.x = id,
.y = data,
~ggplot(data = .y) +
geom_bar(aes(y = value, x = date, fill = variable), stat = "identity") +
ggtitle(paste0("id", .x))))
This is some very advanced stuff, and again, do not worry if you don’t understand everything now.
We are going to learn about this in detail in Chapter 9. Let’s go through each line.
In the first line, I have started from my clean data nested_survey_data
and then, using the %>%
and mutate()
I create a new column called plot
. Inside the mutate()
function, I called map2
.
map2
is a {purrr}
function that takes three inputs: .x
, .y
and a function. .x
is the id
column from my data, and .y
is the data
column from nested_survey_data
. The function is the
ggplot I created before. Think of map2()
has a loop over two lists, all while applying a function.
The following illustration, taken from Vaudor (2018) by Lise Vaudor, illustrates this perfectly:
Two inputs go in at the same type, the factory, your function, does what it has to do, and an
output comes out. Forget about the factory’s chimney, which represents walk2()
for now. We’ll
learn about it in due time. By the way, you really should read Lise’s blog, her posts are really
great and informative. It’s in French, but that’s not a problem, you know how to read R code, right?
Here’s a link to her blog.
Now, let’s take a look at the result:
## # A tibble: 4 x 3
## id data plot
## <dbl> <list> <list>
## 1 1 <tibble [3 × 3]> <S3: gg>
## 2 2 <tibble [3 × 3]> <S3: gg>
## 3 3 <tibble [3 × 3]> <S3: gg>
## 4 4 <tibble [3 × 3]> <S3: gg>
my_plots
is a tibble with three columns: id
, data
and.. plot
! plot
is a very interesting
column. It is a list, where each element is of type S3: gg
. Yes, you guessed it, each element of
that list-column is a ggplot! If you now want to take a look at the plots, it is enough to do this:
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
Ok, that was quite complicated, but again, this was only to introduce nest()
and give you a taste
of the power of the {tidyverse}
. By the end of Chapter 9, you’ll be used to this.
That was it for a first introduction to {tidyr}
. Now, it’s time to learn about scoped verbs.
5.5 Scoped {tidyverse}
verbs
Scoped verbs are special versions of the verbs you are now familiar with.
5.5.1 filter_*()
Let’s go back to the gasoline
data from the {plm}
package.
filter()
is not the only filtering verb there is. Suppose that we have a condition that we want
to use to filter out a lot of columns at once. For example, for every column that is of type
numeric
, keep only the lines where the condition value > -8 is satisfied. The next line does
that:
## # A tibble: 30 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 CANADA 1972 4.89 -5.44 -1.10 -7.99
## 2 CANADA 1973 4.90 -5.41 -1.13 -7.94
## 3 CANADA 1974 4.89 -5.42 -1.12 -7.90
## 4 CANADA 1975 4.89 -5.38 -1.19 -7.87
## 5 CANADA 1976 4.84 -5.36 -1.06 -7.81
## 6 CANADA 1977 4.81 -5.34 -1.07 -7.77
## 7 CANADA 1978 4.86 -5.31 -1.07 -7.79
## 8 GERMANY 1978 3.88 -5.56 -0.628 -7.95
## 9 SWEDEN 1975 3.97 -7.68 -2.77 -7.99
## 10 SWEDEN 1976 3.98 -7.67 -2.82 -7.96
## # ... with 20 more rows
This is a bit more complicated than before, but let’s go through it, step by step.
filter_if()
needs 3 arguments to work; the data, a
predicate function (a function that returns TRUE
, or FALSE
) which will select the columns we
want to work on, and then the condition. The condition can be applied to all the columns that
were selected by the predicate function (hence the all_vars()
) or only to at least one (you’d use
any_vars()
then). Try to change the condition, or the predicate function, to figure out how
filter_if()
works. The dot is a placeholder that stands for whatever columns where selected.
Another scoped filter()
verb is filter_at()
; it allows the user to filter columns by position:
## # A tibble: 30 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 CANADA 1972 4.89 -5.44 -1.10 -7.99
## 2 CANADA 1973 4.90 -5.41 -1.13 -7.94
## 3 CANADA 1974 4.89 -5.42 -1.12 -7.90
## 4 CANADA 1975 4.89 -5.38 -1.19 -7.87
## 5 CANADA 1976 4.84 -5.36 -1.06 -7.81
## 6 CANADA 1977 4.81 -5.34 -1.07 -7.77
## 7 CANADA 1978 4.86 -5.31 -1.07 -7.79
## 8 GERMANY 1978 3.88 -5.56 -0.628 -7.95
## 9 SWEDEN 1975 3.97 -7.68 -2.77 -7.99
## 10 SWEDEN 1976 3.98 -7.67 -2.82 -7.96
## # ... with 20 more rows
we already know about ends_with()
and starts_with()
. So the above line means “for the columns
whose name end with a ‘p’ only keep the lines where, for all the selected columns, the values are
strictly superior to -8
”. Again, this is not very easy the first time you deal with that, so
play around with it for a bit.
Finally, there is also filter_all()
which, as the name implies, uses all the variables for
the filtering step.
filter_if()
and filter_at()
are very useful when you have very large datasets with a lot of
variables and you want to apply a filtering function only to a subset of them. filter_all()
is
useful if, for example, you only want to keep the positive values for all the columns.
5.5.2 select_*() and rename_*()
Just as with filter()
, select()
also has scoped versions. select_if()
makes it easy to
select columns that satisfy a criterium:
## # A tibble: 342 x 5
## year lgaspcar lincomep lrpmg lcarpcap
## * <int> <dbl> <dbl> <dbl> <dbl>
## 1 1960 4.17 -6.47 -0.335 -9.77
## 2 1961 4.10 -6.43 -0.351 -9.61
## 3 1962 4.07 -6.41 -0.380 -9.46
## 4 1963 4.06 -6.37 -0.414 -9.34
## 5 1964 4.04 -6.32 -0.445 -9.24
## 6 1965 4.03 -6.29 -0.497 -9.12
## 7 1966 4.05 -6.25 -0.467 -9.02
## 8 1967 4.05 -6.23 -0.506 -8.93
## 9 1968 4.05 -6.21 -0.522 -8.85
## 10 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
You can even pass a further function to select_if()
that will be applied to the selected columns:
## # A tibble: 342 x 5
## YEAR LGASPCAR LINCOMEP LRPMG LCARPCAP
## * <int> <dbl> <dbl> <dbl> <dbl>
## 1 1960 4.17 -6.47 -0.335 -9.77
## 2 1961 4.10 -6.43 -0.351 -9.61
## 3 1962 4.07 -6.41 -0.380 -9.46
## 4 1963 4.06 -6.37 -0.414 -9.34
## 5 1964 4.04 -6.32 -0.445 -9.24
## 6 1965 4.03 -6.29 -0.497 -9.12
## 7 1966 4.05 -6.25 -0.467 -9.02
## 8 1967 4.05 -6.23 -0.506 -8.93
## 9 1968 4.05 -6.21 -0.522 -8.85
## 10 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
select_at()
makes it easy to select all the variables that start with “l” for example:
## # A tibble: 342 x 4
## lgaspcar lincomep lrpmg lcarpcap
## * <dbl> <dbl> <dbl> <dbl>
## 1 4.17 -6.47 -0.335 -9.77
## 2 4.10 -6.43 -0.351 -9.61
## 3 4.07 -6.41 -0.380 -9.46
## 4 4.06 -6.37 -0.414 -9.34
## 5 4.04 -6.32 -0.445 -9.24
## 6 4.03 -6.29 -0.497 -9.12
## 7 4.05 -6.25 -0.467 -9.02
## 8 4.05 -6.23 -0.506 -8.93
## 9 4.05 -6.21 -0.522 -8.85
## 10 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
select_at()
also works if you specify the position of the columns you’re interested in:
## # A tibble: 342 x 3
## country year lrpmg
## * <fct> <int> <dbl>
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # ... with 332 more rows
Notice that I use a new helper function, vars()
, which allows me to specify which variables I want
to select. This is similar to the helper functions we have seen before, all_vars()
and any_vars()
. So the way to read the above line would be something like “Start with the
gasoline data, then select the variables at position 1, 2, 5”. Knowing how to read your code makes
it even easier to write.
There is also a select_all()
, and you might be wondering why it is useful:
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## * <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
This simply returns all the columns of the data frame… but wait, selec_all()
allows you to do
something very useful, which is rename all the columns in bulk:
## # A tibble: 342 x 6
## COUNTRY YEAR LGASPCAR LINCOMEP LRPMG LCARPCAP
## * <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
You can apply any arbitrary function:
## # A tibble: 342 x 6
## hello_country hello_year hello_lgaspcar hello_lincomep hello_lrpmg
## * <fct> <int> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335
## 2 AUSTRIA 1961 4.10 -6.43 -0.351
## 3 AUSTRIA 1962 4.07 -6.41 -0.380
## 4 AUSTRIA 1963 4.06 -6.37 -0.414
## 5 AUSTRIA 1964 4.04 -6.32 -0.445
## 6 AUSTRIA 1965 4.03 -6.29 -0.497
## 7 AUSTRIA 1966 4.05 -6.25 -0.467
## 8 AUSTRIA 1967 4.05 -6.23 -0.506
## 9 AUSTRIA 1968 4.05 -6.21 -0.522
## 10 AUSTRIA 1969 4.05 -6.15 -0.559
## # ... with 332 more rows, and 1 more variable: hello_lcarpcap <dbl>
I have passed the paste0()
function to the funs()
helper function of select_all()
. funs()
is
used a lot with scoped verbs, just like vars()
, which we have already seen.
The scoped version of rename()
work just as you expect. But I’ll leave these for the exercices
down below.
5.5.3 group_by_*()
To illustrate group_by_*()
I have to first modify the gasoline
data a little bit. As you can
see below, the data column is of type integer:
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## * <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
Let’s change that to character:
Now, this allows me to do the following:
## # A tibble: 19 x 2
## year `mean(lincomep)`
## <chr> <dbl>
## 1 1960 -6.50
## 2 1961 -6.46
## 3 1962 -6.42
## 4 1963 -6.37
## 5 1964 -6.33
## 6 1965 -6.29
## 7 1966 -6.25
## 8 1967 -6.21
## 9 1968 -6.18
## 10 1969 -6.12
## 11 1970 -6.07
## 12 1971 -6.04
## 13 1972 -5.99
## 14 1973 -5.94
## 15 1974 -5.93
## 16 1975 -5.93
## 17 1976 -5.89
## 18 1977 -5.87
## 19 1978 -5.84
This is faster than having to write:
## # A tibble: 342 x 3
## # Groups: country [?]
## country year `mean(lincomep)`
## <fct> <chr> <dbl>
## 1 AUSTRIA 1960 -6.47
## 2 AUSTRIA 1961 -6.43
## 3 AUSTRIA 1962 -6.41
## 4 AUSTRIA 1963 -6.37
## 5 AUSTRIA 1964 -6.32
## 6 AUSTRIA 1965 -6.29
## 7 AUSTRIA 1966 -6.25
## 8 AUSTRIA 1967 -6.23
## 9 AUSTRIA 1968 -6.21
## 10 AUSTRIA 1969 -6.15
## # ... with 332 more rows
You may think that having two write the name of two variables is not a huge hassle, which is true.
But imagine that you have dozens of character columns that you want to group by. This is the kind
of situation where group_by_if()
is very useful. If you prefer, you can specify the position of
the columns instead of the name of the columns, with group_by_at()
:
## # A tibble: 342 x 3
## # Groups: country [?]
## country year `mean(lincomep)`
## <fct> <chr> <dbl>
## 1 AUSTRIA 1960 -6.47
## 2 AUSTRIA 1961 -6.43
## 3 AUSTRIA 1962 -6.41
## 4 AUSTRIA 1963 -6.37
## 5 AUSTRIA 1964 -6.32
## 6 AUSTRIA 1965 -6.29
## 7 AUSTRIA 1966 -6.25
## 8 AUSTRIA 1967 -6.23
## 9 AUSTRIA 1968 -6.21
## 10 AUSTRIA 1969 -6.15
## # ... with 332 more rows
There is also a group_by_all()
, but I fail to see the use case…
5.5.4 summarise_()
Just like for filter()
, select()
, and group_by()
, summarise()` comes with scoped versions:
## # A tibble: 18 x 5
## country lgaspcar lincomep lrpmg lcarpcap
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -6.12 -0.486 -8.85
## 2 BELGIUM 3.92 -5.85 -0.326 -8.63
## 3 CANADA 4.86 -5.58 -1.05 -8.08
## 4 DENMARK 4.19 -5.76 -0.358 -8.58
## 5 FRANCE 3.82 -5.87 -0.253 -8.45
## 6 GERMANY 3.89 -5.85 -0.517 -8.51
## 7 GREECE 4.88 -6.61 -0.0339 -10.8
## 8 IRELAND 4.23 -6.44 -0.348 -9.04
## 9 ITALY 3.73 -6.35 -0.152 -8.83
## 10 JAPAN 4.70 -6.25 -0.287 -9.95
## 11 NETHERLA 4.08 -5.92 -0.370 -8.82
## 12 NORWAY 4.11 -5.75 -0.278 -8.77
## 13 SPAIN 4.06 -5.63 0.739 -9.90
## 14 SWEDEN 4.01 -7.82 -2.71 -8.25
## 15 SWITZERL 4.24 -5.93 -0.902 -8.54
## 16 TURKEY 5.77 -7.34 -0.422 -12.5
## 17 U.K. 3.98 -6.02 -0.459 -8.55
## 18 U.S.A. 4.82 -5.45 -1.21 -7.78
See how I managed to summarise every variable in one simple call to summarise_at()
? Simply by
using vars()
and specifying that I was interested in the ones that started with “l” and then I
specified the function I wanted. But what if I wanted to use more than one function to summarise
the data? Very easy:
## # A tibble: 18 x 17
## country lgaspcar_mean lincomep_mean lrpmg_mean lcarpcap_mean lgaspcar_sd
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -6.12 -0.486 -8.85 0.0693
## 2 BELGIUM 3.92 -5.85 -0.326 -8.63 0.103
## 3 CANADA 4.86 -5.58 -1.05 -8.08 0.0262
## 4 DENMARK 4.19 -5.76 -0.358 -8.58 0.158
## 5 FRANCE 3.82 -5.87 -0.253 -8.45 0.0499
## 6 GERMANY 3.89 -5.85 -0.517 -8.51 0.0239
## 7 GREECE 4.88 -6.61 -0.0339 -10.8 0.255
## 8 IRELAND 4.23 -6.44 -0.348 -9.04 0.0437
## 9 ITALY 3.73 -6.35 -0.152 -8.83 0.220
## 10 JAPAN 4.70 -6.25 -0.287 -9.95 0.684
## 11 NETHER… 4.08 -5.92 -0.370 -8.82 0.286
## 12 NORWAY 4.11 -5.75 -0.278 -8.77 0.123
## 13 SPAIN 4.06 -5.63 0.739 -9.90 0.317
## 14 SWEDEN 4.01 -7.82 -2.71 -8.25 0.0364
## 15 SWITZE… 4.24 -5.93 -0.902 -8.54 0.102
## 16 TURKEY 5.77 -7.34 -0.422 -12.5 0.329
## 17 U.K. 3.98 -6.02 -0.459 -8.55 0.0479
## 18 U.S.A. 4.82 -5.45 -1.21 -7.78 0.0219
## # ... with 11 more variables: lincomep_sd <dbl>, lrpmg_sd <dbl>,
## # lcarpcap_sd <dbl>, lgaspcar_max <dbl>, lincomep_max <dbl>,
## # lrpmg_max <dbl>, lcarpcap_max <dbl>, lgaspcar_min <dbl>,
## # lincomep_min <dbl>, lrpmg_min <dbl>, lcarpcap_min <dbl>
Just use vars()
to specify the variables you want to summarise, and then funs()
to list all the
functions you want to use on each of the columns.
But maybe you’re just interested in descriptive statistics for some variables, but not all those
that start with “l”? What if you want to use another pattern? Easy to do with the contains()
helper:
gasoline %>%
group_by(country) %>%
summarise_at(vars(dplyr::contains("car")), funs(mean, sd, max, min))
## # A tibble: 18 x 9
## country lgaspcar_mean lcarpcap_mean lgaspcar_sd lcarpcap_sd lgaspcar_max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -8.85 0.0693 0.473 4.20
## 2 BELGIUM 3.92 -8.63 0.103 0.417 4.16
## 3 CANADA 4.86 -8.08 0.0262 0.195 4.90
## 4 DENMARK 4.19 -8.58 0.158 0.349 4.50
## 5 FRANCE 3.82 -8.45 0.0499 0.344 3.91
## 6 GERMANY 3.89 -8.51 0.0239 0.406 3.93
## 7 GREECE 4.88 -10.8 0.255 0.839 5.38
## 8 IRELAND 4.23 -9.04 0.0437 0.345 4.33
## 9 ITALY 3.73 -8.83 0.220 0.639 4.05
## 10 JAPAN 4.70 -9.95 0.684 1.20 6.00
## 11 NETHER… 4.08 -8.82 0.286 0.617 4.65
## 12 NORWAY 4.11 -8.77 0.123 0.438 4.44
## 13 SPAIN 4.06 -9.90 0.317 0.960 4.75
## 14 SWEDEN 4.01 -8.25 0.0364 0.242 4.07
## 15 SWITZE… 4.24 -8.54 0.102 0.378 4.44
## 16 TURKEY 5.77 -12.5 0.329 0.751 6.16
## 17 U.K. 3.98 -8.55 0.0479 0.281 4.10
## 18 U.S.A. 4.82 -7.78 0.0219 0.162 4.86
## # ... with 3 more variables: lcarpcap_max <dbl>, lgaspcar_min <dbl>,
## # lcarpcap_min <dbl>
I used dplyr::contains()
instead of simply contains()
because there’s also a
purrr::contains()
. If you load purrr
after dplyr
, contains()
will actually be
purrr::contains()
and not dplyr::contains()
which causes the above code to fail.
There’s also summarise_if()
:
## # A tibble: 18 x 17
## country lgaspcar_mean lincomep_mean lrpmg_mean lcarpcap_mean lgaspcar_sd
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -6.12 -0.486 -8.85 0.0693
## 2 BELGIUM 3.92 -5.85 -0.326 -8.63 0.103
## 3 CANADA 4.86 -5.58 -1.05 -8.08 0.0262
## 4 DENMARK 4.19 -5.76 -0.358 -8.58 0.158
## 5 FRANCE 3.82 -5.87 -0.253 -8.45 0.0499
## 6 GERMANY 3.89 -5.85 -0.517 -8.51 0.0239
## 7 GREECE 4.88 -6.61 -0.0339 -10.8 0.255
## 8 IRELAND 4.23 -6.44 -0.348 -9.04 0.0437
## 9 ITALY 3.73 -6.35 -0.152 -8.83 0.220
## 10 JAPAN 4.70 -6.25 -0.287 -9.95 0.684
## 11 NETHER… 4.08 -5.92 -0.370 -8.82 0.286
## 12 NORWAY 4.11 -5.75 -0.278 -8.77 0.123
## 13 SPAIN 4.06 -5.63 0.739 -9.90 0.317
## 14 SWEDEN 4.01 -7.82 -2.71 -8.25 0.0364
## 15 SWITZE… 4.24 -5.93 -0.902 -8.54 0.102
## 16 TURKEY 5.77 -7.34 -0.422 -12.5 0.329
## 17 U.K. 3.98 -6.02 -0.459 -8.55 0.0479
## 18 U.S.A. 4.82 -5.45 -1.21 -7.78 0.0219
## # ... with 11 more variables: lincomep_sd <dbl>, lrpmg_sd <dbl>,
## # lcarpcap_sd <dbl>, lgaspcar_min <dbl>, lincomep_min <dbl>,
## # lrpmg_min <dbl>, lcarpcap_min <dbl>, lgaspcar_max <dbl>,
## # lincomep_max <dbl>, lrpmg_max <dbl>, lcarpcap_max <dbl>
This allows you to summarise every column that contain real numbers. The difference between
is.double()
and is.numeric()
is that is.numeric()
returns TRUE
for integers too, whereas
is.double()
returns TRUE
for real numbers only (integers are real numbers too, but you know
what I mean). To go faster, you can also use summarise_all()
:
## # A tibble: 18 x 17
## country lgaspcar_mean lincomep_mean lrpmg_mean lcarpcap_mean lgaspcar_sd
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -6.12 -0.486 -8.85 0.0693
## 2 BELGIUM 3.92 -5.85 -0.326 -8.63 0.103
## 3 CANADA 4.86 -5.58 -1.05 -8.08 0.0262
## 4 DENMARK 4.19 -5.76 -0.358 -8.58 0.158
## 5 FRANCE 3.82 -5.87 -0.253 -8.45 0.0499
## 6 GERMANY 3.89 -5.85 -0.517 -8.51 0.0239
## 7 GREECE 4.88 -6.61 -0.0339 -10.8 0.255
## 8 IRELAND 4.23 -6.44 -0.348 -9.04 0.0437
## 9 ITALY 3.73 -6.35 -0.152 -8.83 0.220
## 10 JAPAN 4.70 -6.25 -0.287 -9.95 0.684
## 11 NETHER… 4.08 -5.92 -0.370 -8.82 0.286
## 12 NORWAY 4.11 -5.75 -0.278 -8.77 0.123
## 13 SPAIN 4.06 -5.63 0.739 -9.90 0.317
## 14 SWEDEN 4.01 -7.82 -2.71 -8.25 0.0364
## 15 SWITZE… 4.24 -5.93 -0.902 -8.54 0.102
## 16 TURKEY 5.77 -7.34 -0.422 -12.5 0.329
## 17 U.K. 3.98 -6.02 -0.459 -8.55 0.0479
## 18 U.S.A. 4.82 -5.45 -1.21 -7.78 0.0219
## # ... with 11 more variables: lincomep_sd <dbl>, lrpmg_sd <dbl>,
## # lcarpcap_sd <dbl>, lgaspcar_min <dbl>, lincomep_min <dbl>,
## # lrpmg_min <dbl>, lcarpcap_min <dbl>, lgaspcar_max <dbl>,
## # lincomep_max <dbl>, lrpmg_max <dbl>, lcarpcap_max <dbl>
I removed the year
variable because it’s not a variable for which we want to have descriptive
statistics.
5.5.5 mutate_*()
Of course, mutate()
and transmute()
also come with scoped versions:
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 64.9 0.00154 0.716 0.0000573
## 2 AUSTRIA 1961 60.4 0.00162 0.704 0.0000671
## 3 AUSTRIA 1962 58.7 0.00165 0.684 0.0000781
## 4 AUSTRIA 1963 57.9 0.00171 0.661 0.0000876
## 5 AUSTRIA 1964 56.7 0.00180 0.641 0.0000973
## 6 AUSTRIA 1965 56.5 0.00185 0.608 0.000109
## 7 AUSTRIA 1966 57.3 0.00193 0.627 0.000121
## 8 AUSTRIA 1967 57.6 0.00196 0.603 0.000132
## 9 AUSTRIA 1968 57.1 0.00202 0.593 0.000144
## 10 AUSTRIA 1969 57.2 0.00213 0.572 0.000152
## # ... with 332 more rows
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 64.9 0.00154 0.716 0.0000573
## 2 AUSTRIA 1961 60.4 0.00162 0.704 0.0000671
## 3 AUSTRIA 1962 58.7 0.00165 0.684 0.0000781
## 4 AUSTRIA 1963 57.9 0.00171 0.661 0.0000876
## 5 AUSTRIA 1964 56.7 0.00180 0.641 0.0000973
## 6 AUSTRIA 1965 56.5 0.00185 0.608 0.000109
## 7 AUSTRIA 1966 57.3 0.00193 0.627 0.000121
## 8 AUSTRIA 1967 57.6 0.00196 0.603 0.000132
## 9 AUSTRIA 1968 57.1 0.00202 0.593 0.000144
## 10 AUSTRIA 1969 57.2 0.00213 0.572 0.000152
## # ... with 332 more rows
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 AUSTRIA 1960 4.173244195 -6.474277179 -0.334547613 -9.766839569
## 2 AUSTRIA 1961 4.1009891049 -6.426005835 -0.351327614 -9.608621845
## 3 AUSTRIA 1962 4.0731765511 -6.407308295 -0.379517692 -9.457256552
## 4 AUSTRIA 1963 4.0595091239 -6.370678539 -0.414251392 -9.343154947
## 5 AUSTRIA 1964 4.037688787 -6.322246805 -0.445335362 -9.237739346
## 6 AUSTRIA 1965 4.033983285 -6.294667914 -0.497060662 -9.123903477
## 7 AUSTRIA 1966 4.0475365589 -6.252545451 -0.466837731 -9.019822048
## 8 AUSTRIA 1967 4.0529106939 -6.234580709 -0.505883405 -8.934402537
## 9 AUSTRIA 1968 4.045507048 -6.206894403 -0.522412545 -8.847967407
## 10 AUSTRIA 1969 4.0463547891 -6.153139668 -0.559110514 -8.788686207
## # ... with 332 more rows
I think that by now, you are able to understand the above lines quite easily. If not, try some other functions, or mutating other variables and you’ll see that it is not that complicated!
5.6 Other useful {tidyverse}
functions
5.6.1 if_else()
, case_when()
and recode()
Some other very useful {tidyverse}
functions are if_else()
and case_when
. These two
functions, combined with mutate()
make it easy to create a new variable whose values must
respect certain conditions. For instance, we might want to have a dummy that equals 1
if a country
in the European Union (to simplify, say as of 2017) and 0
if not. First let’s create a list of
countries that are in the EU:
eu_countries <- c("austria", "belgium", "bulgaria", "croatia", "republic of cyprus",
"czech republic", "denmark", "estonia", "finland", "france", "germany",
"greece", "hungary", "ireland", "italy", "latvia", "lithuania", "luxembourg",
"malta", "netherla", "poland", "portugal", "romania", "slovakia", "slovenia",
"spain", "sweden", "u.k.")
I’ve had to change “netherlands” to “netherla” because that’s how the country is called in the
gasoline
data. Now let’s create a dummy variable that equals 1
for EU countries, and 0
for the others:
gasoline %>%
mutate(country = tolower(country)) %>%
mutate(in_eu = if_else(country %in% eu_countries, 1, 0))
## # A tibble: 342 x 7
## country year lgaspcar lincomep lrpmg lcarpcap in_eu
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 1
## 2 austria 1961 4.10 -6.43 -0.351 -9.61 1
## 3 austria 1962 4.07 -6.41 -0.380 -9.46 1
## 4 austria 1963 4.06 -6.37 -0.414 -9.34 1
## 5 austria 1964 4.04 -6.32 -0.445 -9.24 1
## 6 austria 1965 4.03 -6.29 -0.497 -9.12 1
## 7 austria 1966 4.05 -6.25 -0.467 -9.02 1
## 8 austria 1967 4.05 -6.23 -0.506 -8.93 1
## 9 austria 1968 4.05 -6.21 -0.522 -8.85 1
## 10 austria 1969 4.05 -6.15 -0.559 -8.79 1
## # ... with 332 more rows
Instead of 1
and 0
, we can of course use strings (I add filter(year == 1960)
at the end to
have a better view of what happened):
gasoline %>%
mutate(country = tolower(country)) %>%
mutate(in_eu = if_else(country %in% eu_countries, "yes", "no")) %>%
filter(year == 1960)
## # A tibble: 18 x 7
## country year lgaspcar lincomep lrpmg lcarpcap in_eu
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 yes
## 2 belgium 1960 4.16 -6.22 -0.166 -9.41 yes
## 3 canada 1960 4.86 -5.89 -0.972 -8.38 no
## 4 denmark 1960 4.50 -6.06 -0.196 -9.33 yes
## 5 france 1960 3.91 -6.26 -0.0196 -9.15 yes
## 6 germany 1960 3.92 -6.16 -0.186 -9.34 yes
## 7 greece 1960 5.04 -7.16 -0.0835 -12.2 yes
## 8 ireland 1960 4.27 -6.72 -0.0765 -9.70 yes
## 9 italy 1960 4.05 -6.73 0.165 -10.1 yes
## 10 japan 1960 6.00 -6.99 -0.145 -12.2 no
## 11 netherla 1960 4.65 -6.22 -0.201 -10.00 yes
## 12 norway 1960 4.44 -6.09 -0.140 -9.68 no
## 13 spain 1960 4.75 -6.17 1.13 -11.6 yes
## 14 sweden 1960 4.06 -8.07 -2.52 -8.74 yes
## 15 switzerl 1960 4.40 -6.16 -0.823 -9.26 no
## 16 turkey 1960 6.13 -7.80 -0.253 -13.5 no
## 17 u.k. 1960 4.10 -6.19 -0.391 -9.12 yes
## 18 u.s.a. 1960 4.82 -5.70 -1.12 -8.02 no
I think that if_else()
is fairly straightforward, especially if you know ifelse()
already. You
might be wondering what is the difference between these two. if_else()
is stricter than
ifelse()
and does not do type conversion. Compare the two next lines:
## [1] "0"
Type conversion, especially without a warning is very dangerous. if_else()
’s behaviour which
consists in failing as soon as possble avoids a lot of pain and suffering, especially when
programming non-interactively.
if_else()
also accepts an optional argument, that allows you to specify what should be returned
in case of NA
:
## [1] 999
## [1] NA
case_when()
can be seen as a generalization of if_else()
. Whenever you want to use multiple
if_else()
s, that’s when you know you should use case_when()
(I’m adding the filter at the end
for the same reason as before, to see the output better):
gasoline %>%
mutate(country = tolower(country)) %>%
mutate(region = case_when(
country %in% c("france", "italy", "turkey", "greece", "spain") ~ "mediterranean",
country %in% c("germany", "austria", "switzerl", "belgium", "netherla") ~ "central europe",
country %in% c("canada", "u.s.a.", "u.k.", "ireland") ~ "anglosphere",
country %in% c("denmark", "norway", "sweden") ~ "nordic",
country %in% c("japan") ~ "asia")) %>%
filter(year == 1960)
## # A tibble: 18 x 7
## country year lgaspcar lincomep lrpmg lcarpcap region
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 central europe
## 2 belgium 1960 4.16 -6.22 -0.166 -9.41 central europe
## 3 canada 1960 4.86 -5.89 -0.972 -8.38 anglosphere
## 4 denmark 1960 4.50 -6.06 -0.196 -9.33 nordic
## 5 france 1960 3.91 -6.26 -0.0196 -9.15 mediterranean
## 6 germany 1960 3.92 -6.16 -0.186 -9.34 central europe
## 7 greece 1960 5.04 -7.16 -0.0835 -12.2 mediterranean
## 8 ireland 1960 4.27 -6.72 -0.0765 -9.70 anglosphere
## 9 italy 1960 4.05 -6.73 0.165 -10.1 mediterranean
## 10 japan 1960 6.00 -6.99 -0.145 -12.2 asia
## 11 netherla 1960 4.65 -6.22 -0.201 -10.00 central europe
## 12 norway 1960 4.44 -6.09 -0.140 -9.68 nordic
## 13 spain 1960 4.75 -6.17 1.13 -11.6 mediterranean
## 14 sweden 1960 4.06 -8.07 -2.52 -8.74 nordic
## 15 switzerl 1960 4.40 -6.16 -0.823 -9.26 central europe
## 16 turkey 1960 6.13 -7.80 -0.253 -13.5 mediterranean
## 17 u.k. 1960 4.10 -6.19 -0.391 -9.12 anglosphere
## 18 u.s.a. 1960 4.82 -5.70 -1.12 -8.02 anglosphere
If all you want is to recode values, you can use recode()
. For example, the Netherlands is
written as “NETHERLA” in the gasoline
data, which is quite ugly. Same for Switzerland:
gasoline <- gasoline %>%
mutate(country = tolower(country)) %>%
mutate(country = recode(country, "netherla" = "netherlands", "switzerl" = "switzerland"))
I saved the data with these changes as they will become useful in the future. Let’s take a look at the data:
## # A tibble: 2 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 netherlands 1960 4.65 -6.22 -0.201 -10.00
## 2 switzerland 1960 4.40 -6.16 -0.823 -9.26
5.6.2 lead()
and lag()
lead()
and lag()
are especially useful in econometrics. When I was doing my masters, in 4 B.d.
(Before dplyr) lagging variables in panel data was quite tricky. Now, with dplyr
it’s really
very easy:
gasoline %>%
group_by(country) %>%
mutate(lag_lgaspcar = lag(lgaspcar)) %>%
mutate(lead_lgaspcar = lead(lgaspcar)) %>%
filter(year %in% seq(1960, 1963))
## # A tibble: 72 x 8
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap lag_lgaspcar
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 NA
## 2 austria 1961 4.10 -6.43 -0.351 -9.61 4.17
## 3 austria 1962 4.07 -6.41 -0.380 -9.46 4.10
## 4 austria 1963 4.06 -6.37 -0.414 -9.34 4.07
## 5 belgium 1960 4.16 -6.22 -0.166 -9.41 NA
## 6 belgium 1961 4.12 -6.18 -0.172 -9.30 4.16
## 7 belgium 1962 4.08 -6.13 -0.222 -9.22 4.12
## 8 belgium 1963 4.00 -6.09 -0.250 -9.11 4.08
## 9 canada 1960 4.86 -5.89 -0.972 -8.38 NA
## 10 canada 1961 4.83 -5.88 -0.972 -8.35 4.86
## # ... with 62 more rows, and 1 more variable: lead_lgaspcar <dbl>
To lag every variable, remember that you can use mutate_if()
:
## # A tibble: 72 x 6
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77
## 2 austria 1961 4.10 -6.43 -0.351 -9.61
## 3 austria 1962 4.07 -6.41 -0.380 -9.46
## 4 austria 1963 4.06 -6.37 -0.414 -9.34
## 5 belgium 1960 4.16 -6.22 -0.166 -9.41
## 6 belgium 1961 4.12 -6.18 -0.172 -9.30
## 7 belgium 1962 4.08 -6.13 -0.222 -9.22
## 8 belgium 1963 4.00 -6.09 -0.250 -9.11
## 9 canada 1960 4.86 -5.89 -0.972 -8.38
## 10 canada 1961 4.83 -5.88 -0.972 -8.35
## # ... with 62 more rows
you can replace lag()
with lead()
, but just keep in mind that the columns get transformed in
place.
5.6.3 ntile()
The last helper function I will discuss is ntile()
. There are some other, so do read mutate()
’s
documentation with help(mutate)
!
If you need quantiles, you need ntile()
. Let’s see how it works:
gasoline %>%
mutate(quintile = ntile(lgaspcar, 5)) %>%
mutate(decile = ntile(lgaspcar, 10)) %>%
select(country, year, lgaspcar, quintile, decile)
## # A tibble: 342 x 5
## country year lgaspcar quintile decile
## <chr> <dbl> <dbl> <int> <int>
## 1 austria 1960 4.17 3 6
## 2 austria 1961 4.10 3 6
## 3 austria 1962 4.07 3 5
## 4 austria 1963 4.06 3 5
## 5 austria 1964 4.04 3 5
## 6 austria 1965 4.03 3 5
## 7 austria 1966 4.05 3 5
## 8 austria 1967 4.05 3 5
## 9 austria 1968 4.05 3 5
## 10 austria 1969 4.05 3 5
## # ... with 332 more rows
quintile
and decile
do not hold the values but the quantile the value lies in. If you want to
have a column that contains the median for instance, you can use good ol’ quantile()
:
gasoline %>%
group_by(country) %>%
mutate(median = quantile(lgaspcar, 0.5)) %>% # quantile(x, 0.5) is equivalent to median(x)
filter(year == 1960) %>%
select(country, year, median)
## # A tibble: 18 x 3
## # Groups: country [18]
## country year median
## <chr> <dbl> <dbl>
## 1 austria 1960 4.05
## 2 belgium 1960 3.88
## 3 canada 1960 4.86
## 4 denmark 1960 4.16
## 5 france 1960 3.81
## 6 germany 1960 3.89
## 7 greece 1960 4.89
## 8 ireland 1960 4.22
## 9 italy 1960 3.74
## 10 japan 1960 4.52
## 11 netherlands 1960 3.99
## 12 norway 1960 4.08
## 13 spain 1960 3.99
## 14 sweden 1960 4.00
## 15 switzerland 1960 4.26
## 16 turkey 1960 5.72
## 17 u.k. 1960 3.98
## 18 u.s.a. 1960 4.81
5.6.4 arrange()
arrange()
re-orders the whole tibble
according to values of the supplied variable:
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 italy 1977 3.38 -6.10 0.164 -8.15
## 2 italy 1978 3.39 -6.08 0.0348 -8.11
## 3 italy 1976 3.43 -6.12 0.103 -8.17
## 4 italy 1974 3.50 -6.13 -0.223 -8.26
## 5 italy 1975 3.52 -6.17 -0.0327 -8.22
## 6 spain 1978 3.62 -5.29 0.621 -8.63
## 7 italy 1972 3.63 -6.21 -0.215 -8.38
## 8 italy 1971 3.65 -6.22 -0.148 -8.47
## 9 spain 1977 3.65 -5.30 0.526 -8.73
## 10 italy 1973 3.65 -6.16 -0.325 -8.32
## # ... with 332 more rows
If you want to re-order the tibble
in descending order of the variable:
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 turkey 1966 6.16 -7.51 -0.356 -13.0
## 2 turkey 1960 6.13 -7.80 -0.253 -13.5
## 3 turkey 1961 6.11 -7.79 -0.343 -13.4
## 4 turkey 1962 6.08 -7.84 -0.408 -13.2
## 5 turkey 1968 6.08 -7.42 -0.365 -12.8
## 6 turkey 1963 6.08 -7.63 -0.225 -13.3
## 7 turkey 1964 6.06 -7.63 -0.252 -13.2
## 8 turkey 1967 6.04 -7.46 -0.335 -12.8
## 9 japan 1960 6.00 -6.99 -0.145 -12.2
## 10 turkey 1965 5.82 -7.62 -0.293 -12.9
## # ... with 332 more rows
arrange
’s documentation alerts the user that re-ording by group is only possible by explicitely
specifying an option:
gasoline %>%
filter(year %in% seq(1960, 1963)) %>%
group_by(country) %>%
arrange(desc(lgaspcar), .by_group = TRUE)
## # A tibble: 72 x 6
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77
## 2 austria 1961 4.10 -6.43 -0.351 -9.61
## 3 austria 1962 4.07 -6.41 -0.380 -9.46
## 4 austria 1963 4.06 -6.37 -0.414 -9.34
## 5 belgium 1960 4.16 -6.22 -0.166 -9.41
## 6 belgium 1961 4.12 -6.18 -0.172 -9.30
## 7 belgium 1962 4.08 -6.13 -0.222 -9.22
## 8 belgium 1963 4.00 -6.09 -0.250 -9.11
## 9 canada 1960 4.86 -5.89 -0.972 -8.38
## 10 canada 1962 4.85 -5.84 -0.979 -8.32
## # ... with 62 more rows
This is especially useful for plotting. We’ll see this in Chapter 6.
5.6.5 tally()
and count()
tally()
and count()
count the number of observations in your data. I believe count()
is the
more useful of the two, as it counts the number of observations within a group that you can provide:
## # A tibble: 18 x 2
## country n
## <chr> <int>
## 1 austria 19
## 2 belgium 19
## 3 canada 19
## 4 denmark 19
## 5 france 19
## 6 germany 19
## 7 greece 19
## 8 ireland 19
## 9 italy 19
## 10 japan 19
## 11 netherlands 19
## 12 norway 19
## 13 spain 19
## 14 sweden 19
## 15 switzerland 19
## 16 turkey 19
## 17 u.k. 19
## 18 u.s.a. 19
There’s also add_count()
which adds the column to the data:
## # A tibble: 342 x 7
## country year lgaspcar lincomep lrpmg lcarpcap n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19
## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19
## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19
## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19
## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19
## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19
## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19
## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19
## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19
## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19
## # ... with 332 more rows
add_count()
is a shortcut for the following code:
## # A tibble: 342 x 7
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19
## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19
## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19
## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19
## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19
## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19
## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19
## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19
## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19
## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19
## # ... with 332 more rows
where n()
is a dplyr
function that can only be used within summarise()
, mutate()
and
filter()
.
5.7 Special packages for special kinds of data: {forcats}
, {lubridate}
, and {stringr}
5.7.1 🐈🐈🐈🐈
Factor variables are very useful but not very easy to manipulate. forcats
contains very useful
functions that make working on factor variables painless. In my opinion, the four following functions, fct_recode()
, fct_relevel()
, fct_reorder()
and fct_relabel()
, are the ones you must
know, so that’s what I’ll be showing.
Remember in chapter 3 when I very quickly explained what were factor
variables? In this section,
we are going to work a little bit with these type of variable. factor
s are very useful, and the
forcats
package includes some handy functions to work with them. First, let’s load the forcats
package:
as an example, we are going to work with the gss_cat
dataset that is included in forcats
. Let’s
load the data:
## # A tibble: 6 x 9
## year marital age race rincome partyid relig denom tvhours
## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 2000 Never ma… 26 White $8000 to… Ind,near … Protes… Southe… 12
## 2 2000 Divorced 48 White $8000 to… Not str r… Protes… Baptis… NA
## 3 2000 Widowed 67 White Not appl… Independe… Protes… No den… 2
## 4 2000 Never ma… 39 White Not appl… Ind,near … Orthod… Not ap… 4
## 5 2000 Divorced 25 White Not appl… Not str d… None Not ap… 1
## 6 2000 Married 25 White $20000 -… Strong de… Protes… Southe… NA
as you can see, marital
, race
, rincome
and partyid
are all factor variables. Let’s take a closer
look at marital
:
## Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...
and let’s see rincome
:
## Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
factor
variables have different levels and the forcats
package includes functions that allow
you to recode, collapse and do all sorts of things on these levels. For example , using
forcats::fct_recode()
you can recode levels:
gss_cat <- gss_cat %>%
mutate(marital = fct_recode(marital,
refuse = "No answer",
never_married = "Never married",
divorced = "Separated",
divorced = "Divorced",
widowed = "Widowed",
married = "Married"))
gss_cat %>%
tabyl(marital)
## marital n percent
## refuse 17 0.0007913234
## never_married 5416 0.2521063166
## divorced 4126 0.1920588372
## widowed 1807 0.0841130196
## married 10117 0.4709305032
Using fct_recode()
, I was able to recode the levels and collapse Separated
and Divorced
to
a single category called divorced
. As you can see, refuse
and widowed
are less than 10%, so
maybe you’d want to lump these categories together:
gss_cat <- gss_cat %>%
mutate(marital = fct_lump(marital, prop = 0.10, other_level = "other"))
gss_cat %>%
tabyl(marital)
## marital n percent
## never_married 5416 0.25210632
## divorced 4126 0.19205884
## married 10117 0.47093050
## other 1824 0.08490434
fct_reorder()
is especially useful for plotting. We will explore plotting in the next chapter,
but to show you why fct_reorder()
is so useful, I will create a barplot, first without
using fct_reorder()
to re-order the factors, then with reordering. Do not worry if you don’t
understand all the code for now:
It would be much better if the categories were ordered by frequency. This is easy to do with
fct_reorder()
:
gss_cat %>%
tabyl(marital) %>%
mutate(marital = fct_reorder(marital, n, .desc = FALSE)) %>%
ggplot() +
geom_col(aes(y = n, x = marital)) +
coord_flip()
Much better! In Chapter 6, we are going to learn about {ggplot2}
.
{forcats}
contains other very useful functions, so I urge you to go through the documentation.
5.7.2 Get your dates right with {lubridate}
Just like {forcats}
, {lubridate}
contains numerous functions, each one of them very useful.
However, here, I will only focus on a handful of them (those that I use very often). If you need
to work with dates a lot in your work, consider reading all the documentation.
5.7.3 Manipulate strings with {stringr}
{stringr}
contains functions to manipulate strings. In Chapter 11, I will teach you about regular
expressions, but the functions contained in {stringr}
allow you to already do a lot of work on
strings, without needing to be a regular expression expert.
I will discuss the most common string operations: search and replace (or remove), detect, locate or match, and trim a string.
5.8 List-columns
To learn about list-columns, let’s first focus on a single character of the starwars
dataset:
## Observations: 1
## Variables: 13
## $ name <chr> "Luke Skywalker"
## $ height <int> 172
## $ mass <dbl> 77
## $ hair_color <chr> "blond"
## $ skin_color <chr> "fair"
## $ eye_color <chr> "blue"
## $ birth_year <dbl> 19
## $ gender <chr> "male"
## $ homeworld <chr> "Tatooine"
## $ species <chr> "Human"
## $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">]
## $ starships <list> [<"X-wing", "Imperial shuttle">]
We see that the columns films
, vehicles
and starships
are all lists, and in the case of
films
, it lists all the films where Luke Skywalker has appeared. What if you want to take a closer look at this list?
## [[1]]
## [1] "Revenge of the Sith" "Return of the Jedi"
## [3] "The Empire Strikes Back" "A New Hope"
## [5] "The Force Awakens"
pull()
is a dplyr
function that extract (pulls) the column you’re interested in. It is quite
useful when you want to inspect a column.
Suppose we want to create a categorical variable which counts the number of movies in which the
characters have appeared. For this we need to compute the length of the list, or count the number
of elements this list has. Let’s try with length()
a base R function:
## [1] 1
This might be surprising at first, because we know that Luke Skywalker has appeared in more than 1
movie… the problem here is that for each individual, films
is a list, whose single element is
a vector of characters. This means that length(films)
computes the length of the list, which is
one, and not the length of the vector contained in the list! How can we get the length of the
vector of characters contained in the list and for each character? For this we need to use
dplyr::rowwise()
and remove the filter()
function and use mutate()
to add this column to the
dataset:
dplyr::rowwise()
is useful when working with list-columns: columns that have lists as elements.
Let’s take a look at the characters and the number of films they have appeared in:
## Source: local data frame [87 x 2]
## Groups: <by row>
##
## # A tibble: 87 x 2
## name n_films
## <chr> <int>
## 1 Luke Skywalker 5
## 2 C-3PO 6
## 3 R2-D2 7
## 4 Darth Vader 4
## 5 Leia Organa 5
## 6 Owen Lars 3
## 7 Beru Whitesun lars 3
## 8 R5-D4 1
## 9 Biggs Darklighter 1
## 10 Obi-Wan Kenobi 6
## # ... with 77 more rows
Now we can create a factor variable that groups characters by asking whether they appeared only in 1 movie, or more:
starwars <- starwars %>%
mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie",
n_films != 1 ~ "More than 1 movie"))
case_when()
is a dplyr
function that works similarly to the standard if..else..
construct of
many programming languages (R also has this, we are going to learn about it in later chapters).
You can also create list columns with your own datasets, by using tidyr::nest()
. Remember the
fake survey_data
I created to illustrate spread()
and gather()
? Let’s go back to that dataset
again:
## # A tibble: 12 x 3
## id variable value
## <dbl> <chr> <dbl>
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1 1.4
## 5 2 var2 1.9
## 6 2 var3 4.1
## 7 3 var1 0.1
## 8 3 var2 2.8
## 9 3 var3 8.9
## 10 4 var1 1.7
## 11 NA var2 1.9
## 12 4 var3 7.6
## # A tibble: 5 x 2
## id data
## <dbl> <list>
## 1 1 <tibble [2 × 2]>
## 2 NA <tibble [2 × 2]>
## 3 2 <tibble [3 × 2]>
## 4 3 <tibble [3 × 2]>
## 5 4 <tibble [2 × 2]>
This creates a new tibble, with columns id
and data
. data
is a list-column that contains
tibbles; each tibble is the variable
and value
for each individual:
## [[1]]
## # A tibble: 2 x 2
## variable value
## <chr> <dbl>
## 1 var1 1
## 2 var2 0.2
As you can see, for individual 1, the column data contains a 2x2 tibble with columns variable
and
value
. You might be wondering why this is useful, because this seems to introduce an unnecessary
layer of complexity. The usefulness of list-columns will become apparent in the next chapters,
where we are going to learn how to repeat actions over, say, individuals.
5.9 Exercises
Exercise 1
Load the LaborSupply
dataset from the Ecdat
package and answer the following questions:
- Compute the average annual hours worked by year (plus standard deviation)
- What age group worked the most hours in the year 1982?
- Create a variable,
n_years
that equals the number of years an individual stays in the panel. Is the panel balanced? - Which are the individuals that do not have any kids during the whole period? Create a variable,
no_kids
, that flags these individuals (1 = no kids, 0 = kids) - Using the
no_kids
variable from before compute the average wage, standard deviation and number of observations in each group for the year 1980 (no kids group vs kids group). - Create the lagged logarithm of hours worked and wages. Remember that this is a panel.
Exercise 2
- What does the following code do? Copy and paste it in an R interpreter to find out!
mutate_at()
is a scoped version of mutate()
which allows you to specify a number of columns and
functions in one go. This also exists for summarise()
.
- Using
summarise_at()
, compute the mean, standard deviation and number of individuals oflnhr
andlnwg
for each individual.
Exercise 3
In the dataset folder you downloaded at the beginning of the chapter, there is a folder called “unemployment”. I used the data in the section about working with lists of datasets. Using
rio::import_list()
, read the 4 datasets into R.Using
map()
, map thejanitor::clean_names()
function to each dataset (just like in the example in the section on working with lists of datasets). Then, still withmap()
andmutate()
convert all commune names in thecommune
column with the functiontolower()
, in a new column calledlcommune
. This is not an easy exercise; so here are some hints:- Remember that
all_datasets
is a list of datasets. Which function do you use when you want to map a function to each element of a list? - Each element of
all_datasets
aredata.frame
objects. Which function do you use to add a column to adata.frame
? - What symbol can you use to access a column of a
data.frame
?
- Remember that
References
Vaudor, Lise. 2018. “Itération de Fonctions Avec Purrr.” R-Atique: Analyse de Données Avec R. http://perso.ens-lyon.fr/lise.vaudor/iterer-des-fonctions-avec-purrr/.