Chapter 4 Descriptive statistics and data manipulation
Now that we are familiar with some R objects and know how to import data, it is time to write some
code. In this chapter, we are going to compute descriptive statistics for a single dataset, but
also for a list of datasets. However, I will not give a list of functions to compute descriptive
statistics; if you need a specific function you can find easily in the Help pane in Rstudio or
using any modern internet search engine. What I will do is show you a workflow that allows you to
compute the descripitive statisics you need fast.
R has a lot of built-in functions for descriptive statistics; however, if you want to compute
statistics by, say, gender, some more complex manipulations are needed. At least this was true in
the past. Nowadays, thanks to the packages from the tidyverse
, it is very easy and fast to
compute descriptive statistics by any stratifying variable(s). The package we are going to use for
this is called dplyr
. dplyr
contains a lot of functions that make manipulating
data and computing descriptive statistics very easy. To make things easier for now, we are going to
use example data included with dplyr
. So no need to import an external dataset; this does not
change anything to the example that we are going to study here; the source of the data does not
matter for this. Using dplyr
is possible only if the data you are working with is already in
a useful shape. When data is more messy, you will need to first manipulate it to bring it a tidy
format. For this, we will use tidyr
, which is very useful package to reshape data and to do
advanced cleaning of your data.
All these tidyverse functions are also called verbs. However, before getting to know these verbs,
let’s do an analysis using standard, or base R functions. This will be the benchmark against
which we are going to measure a {tidyverse}
workflow.
4.1 A data exploration exercice using base R
Let’s first load the starwars
data set, included in the {dplyr}
package:
library(dplyr)
data(starwars)
Let’s first take a look at the data:
head(starwars)
## # A tibble: 6 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
This data contains information on Star Wars characters. The first question you have to answer is to find the average height of the characters:
mean(starwars$height)
## [1] NA
Let’s also take a look at the standard deviation:
sd(starwars$height)
## [1] NA
It might be more informative to compute these two statistics by species, so for this, we are going
to use aggregate()
:
aggregate(starwars$height,
by = list(Species = starwars$species),
mean)
## Species x
## 1 Aleena 79.0000
## 2 Besalisk 198.0000
## 3 Cerean 198.0000
## 4 Chagrian 196.0000
## 5 Clawdite 168.0000
## 6 Droid NA
## 7 Dug 112.0000
## 8 Ewok 88.0000
## 9 Geonosian 183.0000
## 10 Gungan 208.6667
## 11 Human NA
## 12 Hutt 175.0000
## 13 Iktotchi 188.0000
## 14 Kaleesh 216.0000
## 15 Kaminoan 221.0000
## 16 Kel Dor 188.0000
## 17 Mirialan 168.0000
## 18 Mon Calamari 180.0000
## 19 Muun 191.0000
## 20 Nautolan 196.0000
## 21 Neimodian 191.0000
## 22 Pau'an 206.0000
## 23 Quermian 264.0000
## 24 Rodian 173.0000
## 25 Skakoan 193.0000
## 26 Sullustan 160.0000
## 27 Tholothian 184.0000
## 28 Togruta 178.0000
## 29 Toong 163.0000
## 30 Toydarian 137.0000
## 31 Trandoshan 190.0000
## 32 Twi'lek 179.0000
## 33 Vulptereen 94.0000
## 34 Wookiee 231.0000
## 35 Xexto 122.0000
## 36 Yoda's species 66.0000
## 37 Zabrak 173.0000
Even if you are not familiar with aggregate()
, I believe the above lines are quite self-explanatory.
You need to provide aggregate()
with 3 things; the variable you want to summarize (or only the
data frame, if you want to summarize all variables), a list of grouping variables and then the
function that will be applied to each subset. You can easily add another grouping variable:
aggregate(starwars$height,
by = list(Species = starwars$species,
Homeworld = starwars$homeworld),
mean)
## Species Homeworld x
## 1 Human Alderaan 176.3333
## 2 Aleena Aleen Minor 79.0000
## 3 Human Bespin 175.0000
## 4 Human Bestine IV 180.0000
## 5 Neimodian Cato Neimoidia 191.0000
## 6 Cerean Cerea 198.0000
## 7 Chagrian Champala 196.0000
## 8 Human Chandrila 150.0000
## 9 Human Concord Dawn 183.0000
## 10 Human Corellia 175.0000
## 11 Human Coruscant 168.5000
## 12 Tholothian Coruscant 184.0000
## 13 Zabrak Dathomir 175.0000
## 14 Kel Dor Dorin 188.0000
## 15 Ewok Endor 88.0000
## 16 Human Eriadu 180.0000
## 17 Geonosian Geonosis 183.0000
## 18 Nautolan Glee Anselm 196.0000
## 19 Human Haruun Kal 188.0000
## 20 Iktotchi Iktotch 188.0000
## 21 Zabrak Iridonia 171.0000
## 22 Kaleesh Kalee 216.0000
## 23 Human Kamino 183.0000
## 24 Kaminoan Kamino 221.0000
## 25 Wookiee Kashyyyk 231.0000
## 26 Dug Malastare 112.0000
## 27 Mirialan Mirial 168.0000
## 28 Mon Calamari Mon Cala 180.0000
## 29 Muun Muunilinst 191.0000
## 30 Droid Naboo 96.0000
## 31 Gungan Naboo 208.6667
## 32 Human Naboo 168.4000
## 33 Hutt Nal Hutta 175.0000
## 34 Besalisk Ojom 198.0000
## 35 Quermian Quermia 264.0000
## 36 Rodian Rodia 173.0000
## 37 Twi'lek Ryloth 179.0000
## 38 Human Serenno 193.0000
## 39 Togruta Shili 178.0000
## 40 Skakoan Skako 193.0000
## 41 Human Socorro 177.0000
## 42 Human Stewjon 182.0000
## 43 Sullustan Sullust 160.0000
## 44 Droid Tatooine 132.0000
## 45 Human Tatooine 179.2500
## 46 Toydarian Toydaria 137.0000
## 47 Trandoshan Trandosha 190.0000
## 48 Xexto Troiken 122.0000
## 49 Toong Tund 163.0000
## 50 Pau'an Utapau 206.0000
## 51 Vulptereen Vulpter 94.0000
## 52 Clawdite Zolan 168.0000
or use another function:
aggregate(starwars$height,
by = list(Species = starwars$species),
sd)
## Species x
## 1 Aleena NA
## 2 Besalisk NA
## 3 Cerean NA
## 4 Chagrian NA
## 5 Clawdite NA
## 6 Droid NA
## 7 Dug NA
## 8 Ewok NA
## 9 Geonosian NA
## 10 Gungan 14.189198
## 11 Human NA
## 12 Hutt NA
## 13 Iktotchi NA
## 14 Kaleesh NA
## 15 Kaminoan 11.313708
## 16 Kel Dor NA
## 17 Mirialan 2.828427
## 18 Mon Calamari NA
## 19 Muun NA
## 20 Nautolan NA
## 21 Neimodian NA
## 22 Pau'an NA
## 23 Quermian NA
## 24 Rodian NA
## 25 Skakoan NA
## 26 Sullustan NA
## 27 Tholothian NA
## 28 Togruta NA
## 29 Toong NA
## 30 Toydarian NA
## 31 Trandoshan NA
## 32 Twi'lek 1.414214
## 33 Vulptereen NA
## 34 Wookiee 4.242641
## 35 Xexto NA
## 36 Yoda's species NA
## 37 Zabrak 2.828427
aggregate()
returns a data.frame
object:
class(aggregate(starwars$height, by = list(Species = starwars$species), mean))
## [1] "data.frame"
tapply()
is another base R alternative:
tapply(starwars$height, list(starwars$species), mean)
## Aleena Besalisk Cerean Chagrian Clawdite
## 79.0000 198.0000 198.0000 196.0000 168.0000
## Droid Dug Ewok Geonosian Gungan
## NA 112.0000 88.0000 183.0000 208.6667
## Human Hutt Iktotchi Kaleesh Kaminoan
## NA 175.0000 188.0000 216.0000 221.0000
## Kel Dor Mirialan Mon Calamari Muun Nautolan
## 188.0000 168.0000 180.0000 191.0000 196.0000
## Neimodian Pau'an Quermian Rodian Skakoan
## 191.0000 206.0000 264.0000 173.0000 193.0000
## Sullustan Tholothian Togruta Toong Toydarian
## 160.0000 184.0000 178.0000 163.0000 137.0000
## Trandoshan Twi'lek Vulptereen Wookiee Xexto
## 190.0000 179.0000 94.0000 231.0000 122.0000
## Yoda's species Zabrak
## 66.0000 173.0000
which returns an array
object, which is similar to a vector.
However, tapply()
does not work if you want the mean by species for all the variables in the
data frame:
tapply(starwars, list(starwars$species), mean)
Error in tapply(starwars, list(starwars$species), mean) :
arguments must have same length
In both cases, you can only specify one function. So if you need the average and the standard deviation you have to do it in two steps.
Let’s continue now, by only computing the average height by species, but for males:
starwars_males <- subset(starwars, gender == "male")
aggregate(starwars_males$height,
by = list(Species = starwars_males$species),
mean)
## Species x
## 1 Aleena 79.0000
## 2 Besalisk 198.0000
## 3 Cerean 198.0000
## 4 Chagrian 196.0000
## 5 Dug 112.0000
## 6 Ewok 88.0000
## 7 Geonosian 183.0000
## 8 Gungan 208.6667
## 9 Human NA
## 10 Iktotchi 188.0000
## 11 Kaleesh 216.0000
## 12 Kaminoan 229.0000
## 13 Kel Dor 188.0000
## 14 Mon Calamari 180.0000
## 15 Muun 191.0000
## 16 Nautolan 196.0000
## 17 Neimodian 191.0000
## 18 Pau'an 206.0000
## 19 Quermian 264.0000
## 20 Rodian 173.0000
## 21 Skakoan 193.0000
## 22 Sullustan 160.0000
## 23 Toong 163.0000
## 24 Toydarian 137.0000
## 25 Trandoshan 190.0000
## 26 Twi'lek 180.0000
## 27 Vulptereen 94.0000
## 28 Wookiee 231.0000
## 29 Xexto 122.0000
## 30 Yoda's species 66.0000
## 31 Zabrak 173.0000
I first use subset()
to create a subset of the data in which I only kept males. Then, I rerun
the analysis from before again. subset()
can also be used to select columns. So if you want
the average of the height and mass for males, you could do something like this:
starwars_males_height_mass <- subset(starwars, gender == "male", select = c(height, mass, species))
aggregate(starwars_males_height_mass,
by = list(Species = starwars_males_height_mass$species),
mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Species height mass species
## 1 Aleena 79.0000 15 NA
## 2 Besalisk 198.0000 102 NA
## 3 Cerean 198.0000 82 NA
## 4 Chagrian 196.0000 NA NA
## 5 Dug 112.0000 40 NA
## 6 Ewok 88.0000 20 NA
## 7 Geonosian 183.0000 80 NA
## 8 Gungan 208.6667 NA NA
## 9 Human NA NA NA
## 10 Iktotchi 188.0000 NA NA
## 11 Kaleesh 216.0000 159 NA
## 12 Kaminoan 229.0000 88 NA
## 13 Kel Dor 188.0000 80 NA
## 14 Mon Calamari 180.0000 83 NA
## 15 Muun 191.0000 NA NA
## 16 Nautolan 196.0000 87 NA
## 17 Neimodian 191.0000 90 NA
## 18 Pau'an 206.0000 80 NA
## 19 Quermian 264.0000 NA NA
## 20 Rodian 173.0000 74 NA
## 21 Skakoan 193.0000 48 NA
## 22 Sullustan 160.0000 68 NA
## 23 Toong 163.0000 65 NA
## 24 Toydarian 137.0000 NA NA
## 25 Trandoshan 190.0000 113 NA
## 26 Twi'lek 180.0000 NA NA
## 27 Vulptereen 94.0000 45 NA
## 28 Wookiee 231.0000 124 NA
## 29 Xexto 122.0000 NA NA
## 30 Yoda's species 66.0000 17 NA
## 31 Zabrak 173.0000 NA NA
This is starting to get a bit verbose, but it is quite easy to follow and very powerful. It certainly beats having to write loops to achieve the same thing.
Let’s now consider this new dataset:
survey_data_base
## id var1 var2 var3
## 1 1 1.0 0.2 0.3
## 2 2 1.4 1.9 4.1
## 3 3 0.1 2.8 8.9
## 4 4 1.7 1.9 7.6
I will explain later where this comes from. Depending on what you want to do with this data, it is
not in the right shape. So let’s reshape it, using the aptly-called reshape()
command:
survey_data_long <- reshape(survey_data_base,
varying = list(2:4), v.names = "variable", direction = "long")
We can now easily compute the average of variable
for each id
:
aggregate(survey_data_long$variable,
by = list(Id = survey_data_long$id),
mean)
## Id x
## 1 1 0.500000
## 2 2 2.466667
## 3 3 3.933333
## 4 4 3.733333
There is also the possiblity to merge two datasets with merge()
. I won’t go into that however.
As you can see, R comes with very powerful functions right out of the box, ready to use. When I was
studying, unfortunately, my professors had been brought up on FORTRAN loops, so we had to do to all
this using loops (not reshaping, thankfully), which was not so easy.
Now that we have seen how base R works, let’s redo the analysis using {tidyverse}
verbs.
But before deep diving into the {tidyverse}
, let’s take a moment to discuss about our lord and
saviour, %>%
.
4.2 Smoking is bad for you, but pipes are your friend
The title of this section might sound weird at first, but by the end of it, you’ll get this (terrible) pun.
You probably know the following painting by René Magritte, La trahison des images:
It turns out there’s an R package from the tidyverse
that is called magrittr
. What does this
package do? It brings pipes to R. Pipes are a concept from the Unix operating system; if you’re
using a GNU+Linux distribution or macOS, you’re basically using a modern unix (that’s an
oversimplification, but I’m an economist by training, and outrageously oversimplifying things is
what we do, deal with it).
The idea of pipes is to take the output of a command, and feed it as the input of another
command. The magrittr
package brings pipes to R, by using the weird looking %>%
. Try the
following:
library(magrittr)
16 %>% sqrt
## [1] 4
This looks quite weird, but you probably understand what happened; 16
got fed as the first
argument of the function sqrt()
. You can chain multiple functions:
16 %>% sqrt %>% `+`(18)
## [1] 22
The output of 16
(16
) got fed to sqrt()
, and the output of sqrt(16)
(4) got fed to +(18)
(22). Without %>%
you’d write the line just above like this:
sqrt(16) + 18
## [1] 22
It might not be very clear right now why this is useful, but the %>%
is probably one of the
most useful infix operators, because when using packages from the tidyverse
, you will
naturally want to chain a lot of functions together. Without the %>%
it would become messy very fast.
%>%
is not the only pipe operator in magrittr
. There’s %T%
, %<>%
and %$%
. All have their
uses, but are basically shortcuts to some common tasks with %>%
plus another function. Which
means that you can live without them, and because of this, I will not discuss them.
4.3 The {tidyverse}
’s enfant prodige: {dplyr}
The best way to get started with the tidyverse packages is to get to know {dplyr}
. {dplyr}
provides
a lot of very useful functions that makes it very easy to get discriptive statistics or add new columns
to your data.
4.3.1 A first taste of data manipulation with {dplyr}
This section will walk you through a typical analysis using {dplyr}
funcitons. Just go with it; I
will give more details in the next sections.
First, let’s load dplyr
and the included starwars
dataset. Let’s also take a look at the first 5
lines of the dataset:
library(dplyr)
data(starwars)
head(starwars)
## # A tibble: 6 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
data(starwars)
loads the example dataset called starwars
that is included in the package dplyr
.
As I said earlier, this is just an example; you could have loaded an external dataset, from a
.csv
file for instance. This does not matter for what comes next.
R includes a lot of functions for descriptive statistics, such as mean()
, sd()
, cov()
, and many
more. What dplyr
brings to the table (among other niceties) is the possibility to apply these
functions to the dataset easily. For example, imagine you want the average height of everyone in
the dataset. Using the basic R functions, you could write this:
mean(starwars$height)
## [1] NA
starwars$height
means that the user wants to access the column called height
from the dataset
starwars
. Remember that the $
symbol is how you access elements of a named list. This is the
same for columns of datasets as you can see. This is then given as an argument to the function
mean()
. But what if the user wants the average height by species? Before dplyr
, a solution to
this simple problem would have required more than a single command. Now this is as easy as:
starwars %>%
group_by(species) %>%
summarise(mean(height))
## # A tibble: 38 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 <NA> NA
## 2 Aleena 79
## 3 Besalisk 198
## 4 Cerean 198
## 5 Chagrian 196
## 6 Clawdite 168
## 7 Droid NA
## 8 Dug 112
## 9 Ewok 88
## 10 Geonosian 183
## # … with 28 more rows
The usefulness of the %>%
(pipe operator) becomes apparent now. Without it, one would write
instead:
summarise(group_by(starwars, species), mean(height))
## # A tibble: 38 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 <NA> NA
## 2 Aleena 79
## 3 Besalisk 198
## 4 Cerean 198
## 5 Chagrian 196
## 6 Clawdite 168
## 7 Droid NA
## 8 Dug 112
## 9 Ewok 88
## 10 Geonosian 183
## # … with 28 more rows
as you can clearly see, it is much more difficult to read. Imagine now that I want the average height
by species, but only for males. Again, this is very easy using %>%
:
starwars %>%
filter(gender == "male") %>%
group_by(species) %>%
summarise(mean(height))
## # A tibble: 32 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 <NA> 183
## 2 Aleena 79
## 3 Besalisk 198
## 4 Cerean 198
## 5 Chagrian 196
## 6 Dug 112
## 7 Ewok 88
## 8 Geonosian 183
## 9 Gungan 209.
## 10 Human NA
## # … with 22 more rows
Again, the %>%
makes the above lines of code very easy to read. Without it, one would need to write:
summarise(group_by(filter(starwars, gender == "male"), species), mean(height))
## # A tibble: 32 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 <NA> 183
## 2 Aleena 79
## 3 Besalisk 198
## 4 Cerean 198
## 5 Chagrian 196
## 6 Dug 112
## 7 Ewok 88
## 8 Geonosian 183
## 9 Gungan 209.
## 10 Human NA
## # … with 22 more rows
I think you agree with me that this is not very readable. One way to make it more readable would be to save intermediary variables:
filtered_data <- filter(starwars, gender == "male")
grouped_data <- group_by(filter(starwars, gender == "male"), species)
summarise(grouped_data, mean(height))
## # A tibble: 32 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 <NA> 183
## 2 Aleena 79
## 3 Besalisk 198
## 4 Cerean 198
## 5 Chagrian 196
## 6 Dug 112
## 7 Ewok 88
## 8 Geonosian 183
## 9 Gungan 209.
## 10 Human NA
## # … with 22 more rows
But this can get very tedious. Once you’re used to %>%
, you won’t go back to not use it.
Before continuing and to make things clearer; filter()
, group_by()
and summarise()
are
functions that are included in dplyr
. %>%
is actually a function from magrittr
, but this
package gets loaded on the fly when you load dplyr
, so you do not need to worry about it.
mean()
is a function native to R.
The result of all these operations that use dplyr
functions are actually other datasets, or
tibbles
. This means that you can save them in variable, and then work with these as any other
datasets.
mean_height <- starwars %>%
group_by(species) %>%
summarise(mean(height))
class(mean_height)
## [1] "tbl_df" "tbl" "data.frame"
head(mean_height)
## # A tibble: 6 x 2
## species `mean(height)`
## <chr> <dbl>
## 1 <NA> NA
## 2 Aleena 79
## 3 Besalisk 198
## 4 Cerean 198
## 5 Chagrian 196
## 6 Clawdite 168
You could then write this data to disk using rio::export()
for instance. If you need more than the
mean of the height, you can keep adding as many functions as needed:
summary_table <- starwars %>%
group_by(species) %>%
summarise(ave_height = mean(height), var_height = var(height), n_obs = n())
print(summary_table)
## # A tibble: 38 x 4
## species ave_height var_height n_obs
## <chr> <dbl> <dbl> <int>
## 1 <NA> NA NA 5
## 2 Aleena 79 NA 1
## 3 Besalisk 198 NA 1
## 4 Cerean 198 NA 1
## 5 Chagrian 196 NA 1
## 6 Clawdite 168 NA 1
## 7 Droid NA NA 5
## 8 Dug 112 NA 1
## 9 Ewok 88 NA 1
## 10 Geonosian 183 NA 1
## # … with 28 more rows
I’ve added more functions, namely var()
, to get the variance of height, and n()
, which
is a function from dplyr
, not base R, to get the number of observations. This is quite useful,
because we see that for a lot of species we only have one single individual! Let’s focus on the
species for which we have more than 1 individual. Since we save all the previous operations (which
produce a tibble
) in a variable, we can keep going from there:
summary_table2 <- summary_table %>%
filter(n_obs > 1)
print(summary_table2)
## # A tibble: 9 x 4
## species ave_height var_height n_obs
## <chr> <dbl> <dbl> <int>
## 1 <NA> NA NA 5
## 2 Droid NA NA 5
## 3 Gungan 209. 201. 3
## 4 Human NA NA 35
## 5 Kaminoan 221 128 2
## 6 Mirialan 168 8 2
## 7 Twi'lek 179 2 2
## 8 Wookiee 231 18 2
## 9 Zabrak 173 8 2
There’s a lot of NA
s; this is because by default, mean()
and var()
return NA
if even one
single observation is NA
. This is good, because it forces you to look at the data
to see what is going on. If you would get a number, even if there were NA
s you could very easily
miss these missing values. It is better for functions to fail early and often than the opposite.
mean()
and var()
have a na.rm
option that the user can set to TRUE
to get the result by
ignoring the NA
s:
starwars %>%
group_by(species) %>%
summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>%
filter(n_obs > 1)
## # A tibble: 9 x 4
## species ave_height var_height n_obs
## <chr> <dbl> <dbl> <int>
## 1 <NA> 160 1826 5
## 2 Droid 140 2705. 5
## 3 Gungan 209. 201. 3
## 4 Human 177. 157. 35
## 5 Kaminoan 221 128 2
## 6 Mirialan 168 8 2
## 7 Twi'lek 179 2 2
## 8 Wookiee 231 18 2
## 9 Zabrak 173 8 2
In the code above, I have combined the two previous steps to get the result I’m interested in. There’s
a line in the final output that says NA
for the species. Let’s go back to the raw data and find
these lines:
starwars %>%
filter(is.na(species))
## # A tibble: 5 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Ric … 183 NA brown fair blue NA male
## 2 Quar… 183 NA black dark brown 62 male
## 3 R4-P… 96 NA none silver, r… red, blue NA female
## 4 Sly … 178 48 none pale white NA female
## 5 Capt… NA NA unknown unknown unknown NA female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
To test for NA
, one uses the function is.na()
not something like species == "NA"
or anything
like that. !is.na()
does the opposite:
starwars %>%
filter(!is.na(species))
## # A tibble: 82 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 72 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
The !
function negates a predicate function (a predicate function is a function that returns
TRUE
or FALSE
). We can then rerun our analysis from before:
starwars %>%
filter(!is.na(species)) %>%
group_by(species) %>%
summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>%
filter(n_obs > 1)
## # A tibble: 8 x 4
## species ave_height var_height n_obs
## <chr> <dbl> <dbl> <int>
## 1 Droid 140 2705. 5
## 2 Gungan 209. 201. 3
## 3 Human 177. 157. 35
## 4 Kaminoan 221 128 2
## 5 Mirialan 168 8 2
## 6 Twi'lek 179 2 2
## 7 Wookiee 231 18 2
## 8 Zabrak 173 8 2
And why not compute the same table, but first add another stratifying variable?
starwars %>%
filter(!is.na(species)) %>%
group_by(species, gender) %>%
summarise(ave_height = mean(height, na.rm = TRUE), var_height = var(height, na.rm = TRUE), n_obs = n()) %>%
filter(n_obs > 1)
## # A tibble: 8 x 5
## # Groups: species [6]
## species gender ave_height var_height n_obs
## <chr> <chr> <dbl> <dbl> <int>
## 1 Droid <NA> 120 1657 3
## 2 Droid none 200 NA 2
## 3 Gungan male 209. 201. 3
## 4 Human female 160. 48.8 9
## 5 Human male 182. 67.1 26
## 6 Mirialan female 168 8 2
## 7 Wookiee male 231 18 2
## 8 Zabrak male 173 8 2
Ok, that’s it for a first taste. We have already discovered some very useful {dplyr}
functions,
filter()
, group_by()
and summarise summarise()
.
Now, we are going to learn more about these functions in more detail.
4.3.2 Filter the rows of a dataset with filter()
We’re going to use the Gasoline
dataset from the plm
package, so install that first:
install.packages("plm")
Then load the required data:
data(Gasoline, package = "plm")
and load dplyr:
library(dplyr)
This dataset gives the consumption of gasoline for 18 countries from 1960 to 1978. When you load
the data like this, it is a standard data.frame
. dplyr
functions can be used on standard
data.frame
objects, but also on tibble
s. tibble
s are just like data frame, but with a better
print method (and other niceties). I’ll discuss the {tibble}
package later, but for now, let’s
convert the data to a tibble
and change its name:
gasoline <- as_tibble(Gasoline)
filter()
is pretty straightforward. What if you would like to subset the data to focus on the
year 1969? Simple:
filter(gasoline, year == 1969)
## # A tibble: 18 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 3 CANADA 1969 4.86 -5.56 -1.04 -8.10
## 4 DENMARK 1969 4.17 -5.72 -0.407 -8.47
## 5 FRANCE 1969 3.77 -5.84 -0.315 -8.37
## 6 GERMANY 1969 3.90 -5.83 -0.589 -8.44
## 7 GREECE 1969 4.89 -6.59 -0.180 -10.7
## 8 IRELAND 1969 4.21 -6.38 -0.272 -8.95
## 9 ITALY 1969 3.74 -6.28 -0.248 -8.67
## 10 JAPAN 1969 4.52 -6.16 -0.417 -9.61
## 11 NETHERLA 1969 3.99 -5.88 -0.417 -8.63
## 12 NORWAY 1969 4.09 -5.74 -0.338 -8.69
## 13 SPAIN 1969 3.99 -5.60 0.669 -9.72
## 14 SWEDEN 1969 3.99 -7.77 -2.73 -8.20
## 15 SWITZERL 1969 4.21 -5.91 -0.918 -8.47
## 16 TURKEY 1969 5.72 -7.39 -0.298 -12.5
## 17 U.K. 1969 3.95 -6.03 -0.383 -8.47
## 18 U.S.A. 1969 4.84 -5.41 -1.22 -7.79
Let’s use %>%
, since we’re familiar with it now:
gasoline %>% filter(year == 1969)
## # A tibble: 18 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 3 CANADA 1969 4.86 -5.56 -1.04 -8.10
## 4 DENMARK 1969 4.17 -5.72 -0.407 -8.47
## 5 FRANCE 1969 3.77 -5.84 -0.315 -8.37
## 6 GERMANY 1969 3.90 -5.83 -0.589 -8.44
## 7 GREECE 1969 4.89 -6.59 -0.180 -10.7
## 8 IRELAND 1969 4.21 -6.38 -0.272 -8.95
## 9 ITALY 1969 3.74 -6.28 -0.248 -8.67
## 10 JAPAN 1969 4.52 -6.16 -0.417 -9.61
## 11 NETHERLA 1969 3.99 -5.88 -0.417 -8.63
## 12 NORWAY 1969 4.09 -5.74 -0.338 -8.69
## 13 SPAIN 1969 3.99 -5.60 0.669 -9.72
## 14 SWEDEN 1969 3.99 -7.77 -2.73 -8.20
## 15 SWITZERL 1969 4.21 -5.91 -0.918 -8.47
## 16 TURKEY 1969 5.72 -7.39 -0.298 -12.5
## 17 U.K. 1969 3.95 -6.03 -0.383 -8.47
## 18 U.S.A. 1969 4.84 -5.41 -1.22 -7.79
You can also filter more than just one year, by using the %in%
operator:
gasoline %>% filter(year %in% seq(1969, 1973))
## # A tibble: 90 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 AUSTRIA 1970 4.08 -6.08 -0.597 -8.73
## 3 AUSTRIA 1971 4.11 -6.04 -0.654 -8.64
## 4 AUSTRIA 1972 4.13 -5.98 -0.596 -8.54
## 5 AUSTRIA 1973 4.20 -5.90 -0.594 -8.49
## 6 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 7 BELGIUM 1970 3.87 -5.80 -0.378 -8.45
## 8 BELGIUM 1971 3.87 -5.76 -0.399 -8.41
## 9 BELGIUM 1972 3.91 -5.71 -0.311 -8.36
## 10 BELGIUM 1973 3.90 -5.64 -0.373 -8.31
## # … with 80 more rows
It is also possible use between()
, a helper function:
gasoline %>% filter(between(year, 1969, 1973))
## # A tibble: 90 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 AUSTRIA 1970 4.08 -6.08 -0.597 -8.73
## 3 AUSTRIA 1971 4.11 -6.04 -0.654 -8.64
## 4 AUSTRIA 1972 4.13 -5.98 -0.596 -8.54
## 5 AUSTRIA 1973 4.20 -5.90 -0.594 -8.49
## 6 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 7 BELGIUM 1970 3.87 -5.80 -0.378 -8.45
## 8 BELGIUM 1971 3.87 -5.76 -0.399 -8.41
## 9 BELGIUM 1972 3.91 -5.71 -0.311 -8.36
## 10 BELGIUM 1973 3.90 -5.64 -0.373 -8.31
## # … with 80 more rows
To select non-consecutive years:
gasoline %>% filter(year %in% c(1969, 1973, 1977))
## # A tibble: 54 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 AUSTRIA 1973 4.20 -5.90 -0.594 -8.49
## 3 AUSTRIA 1977 3.93 -5.83 -0.422 -8.25
## 4 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 5 BELGIUM 1973 3.90 -5.64 -0.373 -8.31
## 6 BELGIUM 1977 3.85 -5.56 -0.432 -8.14
## 7 CANADA 1969 4.86 -5.56 -1.04 -8.10
## 8 CANADA 1973 4.90 -5.41 -1.13 -7.94
## 9 CANADA 1977 4.81 -5.34 -1.07 -7.77
## 10 DENMARK 1969 4.17 -5.72 -0.407 -8.47
## # … with 44 more rows
%in%
tests if an object is part of a set.
4.3.3 Select columns with select()
While filter()
allows you to keep or discard rows of data, select()
allows you to keep or discard entire columns. To keep columns:
gasoline %>% select(country, year, lrpmg)
## # A tibble: 342 x 3
## country year lrpmg
## <fct> <int> <dbl>
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # … with 332 more rows
To discard them:
gasoline %>% select(-country, -year, -lrpmg)
## # A tibble: 342 x 3
## lgaspcar lincomep lcarpcap
## <dbl> <dbl> <dbl>
## 1 4.17 -6.47 -9.77
## 2 4.10 -6.43 -9.61
## 3 4.07 -6.41 -9.46
## 4 4.06 -6.37 -9.34
## 5 4.04 -6.32 -9.24
## 6 4.03 -6.29 -9.12
## 7 4.05 -6.25 -9.02
## 8 4.05 -6.23 -8.93
## 9 4.05 -6.21 -8.85
## 10 4.05 -6.15 -8.79
## # … with 332 more rows
To rename them:
gasoline %>% select(country, date = year, lrpmg)
## # A tibble: 342 x 3
## country date lrpmg
## <fct> <int> <dbl>
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # … with 332 more rows
There’s also rename()
:
gasoline %>% rename(date = year)
## # A tibble: 342 x 6
## country date lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
rename()
does not do any kind of selection, but just renames.
You can also use select()
to re-order columns:
gasoline %>% select(year, country, lrpmg, everything())
## # A tibble: 342 x 6
## year country lrpmg lgaspcar lincomep lcarpcap
## <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1960 AUSTRIA -0.335 4.17 -6.47 -9.77
## 2 1961 AUSTRIA -0.351 4.10 -6.43 -9.61
## 3 1962 AUSTRIA -0.380 4.07 -6.41 -9.46
## 4 1963 AUSTRIA -0.414 4.06 -6.37 -9.34
## 5 1964 AUSTRIA -0.445 4.04 -6.32 -9.24
## 6 1965 AUSTRIA -0.497 4.03 -6.29 -9.12
## 7 1966 AUSTRIA -0.467 4.05 -6.25 -9.02
## 8 1967 AUSTRIA -0.506 4.05 -6.23 -8.93
## 9 1968 AUSTRIA -0.522 4.05 -6.21 -8.85
## 10 1969 AUSTRIA -0.559 4.05 -6.15 -8.79
## # … with 332 more rows
everything()
is a helper function, and there’s also starts_with()
,
and ends_with()
. For example, what if we are only interested
in columns whose name start with “l”?
gasoline %>% select(starts_with("l"))
## # A tibble: 342 x 4
## lgaspcar lincomep lrpmg lcarpcap
## <dbl> <dbl> <dbl> <dbl>
## 1 4.17 -6.47 -0.335 -9.77
## 2 4.10 -6.43 -0.351 -9.61
## 3 4.07 -6.41 -0.380 -9.46
## 4 4.06 -6.37 -0.414 -9.34
## 5 4.04 -6.32 -0.445 -9.24
## 6 4.03 -6.29 -0.497 -9.12
## 7 4.05 -6.25 -0.467 -9.02
## 8 4.05 -6.23 -0.506 -8.93
## 9 4.05 -6.21 -0.522 -8.85
## 10 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
ends_with()
works in a similar fashion. There is also contains()
:
gasoline %>% select(country, year, contains("car"))
## # A tibble: 342 x 4
## country year lgaspcar lcarpcap
## <fct> <int> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -9.77
## 2 AUSTRIA 1961 4.10 -9.61
## 3 AUSTRIA 1962 4.07 -9.46
## 4 AUSTRIA 1963 4.06 -9.34
## 5 AUSTRIA 1964 4.04 -9.24
## 6 AUSTRIA 1965 4.03 -9.12
## 7 AUSTRIA 1966 4.05 -9.02
## 8 AUSTRIA 1967 4.05 -8.93
## 9 AUSTRIA 1968 4.05 -8.85
## 10 AUSTRIA 1969 4.05 -8.79
## # … with 332 more rows
Another verb, similar to select()
, is pull()
. Let’s compare the two:
gasoline %>% select(lrpmg)
## # A tibble: 342 x 1
## lrpmg
## <dbl>
## 1 -0.335
## 2 -0.351
## 3 -0.380
## 4 -0.414
## 5 -0.445
## 6 -0.497
## 7 -0.467
## 8 -0.506
## 9 -0.522
## 10 -0.559
## # … with 332 more rows
gasoline %>% pull(lrpmg)
## [1] -0.33454761 -0.35132761 -0.37951769 -0.41425139 -0.44533536
## [6] -0.49706066 -0.46683773 -0.50588340 -0.52241255 -0.55911051
## [11] -0.59656122 -0.65445914 -0.59633184 -0.59444681 -0.46602693
## [16] -0.45414221 -0.50008372 -0.42191563 -0.46960312 -0.16570961
## [21] -0.17173098 -0.22229138 -0.25046225 -0.27591057 -0.34493695
## [26] -0.23639770 -0.26699499 -0.31116076 -0.35480852 -0.37794044
## [31] -0.39922992 -0.31064584 -0.37309192 -0.36223563 -0.36430848
## [36] -0.37896584 -0.43164133 -0.59094964 -0.97210650 -0.97229024
## [41] -0.97860756 -1.01904791 -1.00285696 -1.01712549 -1.01694436
## [46] -1.02359713 -1.01984524 -1.03686389 -1.06733308 -1.05803676
## [51] -1.09966703 -1.13316142 -1.12379997 -1.18568427 -1.06179659
## [56] -1.07084448 -1.07495073 -0.19570260 -0.25361844 -0.21875400
## [61] -0.24800936 -0.30654923 -0.32701542 -0.39618846 -0.44257369
## [66] -0.35204752 -0.40687922 -0.44046082 -0.45473954 -0.49918863
## [71] -0.43257185 -0.42517720 -0.39395431 -0.35361534 -0.35690917
## [76] -0.29068135 -0.01959833 -0.02386000 -0.06892022 -0.13792900
## [81] -0.19784646 -0.23365325 -0.26427164 -0.29405795 -0.32316179
## [86] -0.31519087 -0.33384616 -0.37945667 -0.40781642 -0.47503429
## [91] -0.21698191 -0.25838174 -0.24651309 -0.22550681 -0.38075942
## [96] -0.18591078 -0.23095384 -0.34384171 -0.37464672 -0.39965256
## [101] -0.43987825 -0.54000197 -0.54998139 -0.43824222 -0.58923137
## [106] -0.63329520 -0.67176311 -0.71797458 -0.72587521 -0.56982876
## [111] -0.56482380 -0.62481298 -0.59761210 -0.62817279 -0.08354740
## [116] -0.10421997 -0.13320751 -0.15653576 -0.18051772 -0.07793999
## [121] -0.11491900 -0.13775849 -0.15375883 -0.17986997 -0.20252426
## [126] -0.06761078 -0.11973059 -0.05191029 0.31625351 0.20631574
## [131] 0.19319312 0.23502961 0.16896037 -0.07648118 -0.12040874
## [136] -0.14160039 -0.15232915 -0.24428212 -0.16899366 -0.21071901
## [141] -0.17383533 -0.21339314 -0.27162842 -0.32069023 -0.36041067
## [146] -0.42393131 -0.64567297 -0.55343875 -0.64126416 -0.66134256
## [151] -0.56011483 -0.66277808 0.16507708 -0.08559038 -0.18351291
## [156] -0.26541405 -0.42609643 -0.32712637 -0.24887418 -0.19160048
## [161] -0.20616656 -0.24756681 -0.23271512 -0.14822267 -0.21508857
## [166] -0.32508487 -0.22290860 -0.03270913 0.10292798 0.16418805
## [171] 0.03482212 -0.14532271 -0.14874940 -0.18731459 -0.19996473
## [176] -0.20386433 -0.23786571 -0.27411537 -0.33167240 -0.35126918
## [181] -0.41685019 -0.46203546 -0.43941354 -0.52100094 -0.46270739
## [186] -0.19090636 -0.15948473 -0.20726559 -0.21904447 -0.28707638
## [191] -0.20148480 -0.21599265 -0.25968008 -0.29718661 -0.36929389
## [196] -0.34197503 -0.34809007 -0.31232019 -0.44450431 -0.41694955
## [201] -0.39954544 -0.43393029 -0.31903240 -0.42728193 -0.35253685
## [206] -0.43426178 -0.42908393 -0.46474195 -0.55791459 -0.13968957
## [211] -0.15790514 -0.19908809 -0.23263318 -0.26374731 -0.31593124
## [216] -0.25011726 -0.26555763 -0.30036775 -0.33823045 -0.39072560
## [221] -0.30127223 -0.26023925 -0.33880765 -0.15100924 -0.32726757
## [226] -0.35308752 -0.38255762 -0.30765935 1.12531070 1.10956235
## [231] 1.05700394 0.97683534 0.91532254 0.81666055 0.75671751
## [236] 0.74130811 0.70386453 0.66948950 0.61217208 0.60699563
## [241] 0.53716844 0.43377166 0.52492096 0.62955545 0.68385409
## [246] 0.52627167 0.62141374 -2.52041588 -2.57148340 -2.53448158
## [251] -2.60511224 -2.65801626 -2.64476790 -2.63901460 -2.65609762
## [256] -2.67918662 -2.73190414 -2.73359211 -2.77884554 -2.77467537
## [261] -2.84142900 -2.79840677 -2.76731461 -2.82294480 -2.82005896
## [266] -2.89649671 -0.82321833 -0.86558473 -0.82218510 -0.86012004
## [271] -0.86767682 -0.90528668 -0.85956665 -0.90656671 -0.87232520
## [276] -0.91812162 -0.96344188 -1.03746081 -0.94015345 -0.86722756
## [281] -0.88692306 -0.88475790 -0.90736205 -0.91147285 -1.03208811
## [286] -0.25340821 -0.34252375 -0.40820484 -0.22499174 -0.25219448
## [291] -0.29347614 -0.35640491 -0.33515022 -0.36507386 -0.29845417
## [296] -0.39882648 -0.30461880 -0.54637424 -0.69162023 -0.33965308
## [301] -0.53794675 -0.75141027 -0.95552413 -0.35290961 -0.39108581
## [306] -0.45185308 -0.42287690 -0.46335147 -0.49577430 -0.42654915
## [311] -0.47068145 -0.44118786 -0.46245080 -0.38332457 -0.41899030
## [316] -0.46135978 -0.52777246 -0.56529718 -0.56641296 -0.20867428
## [321] -0.27354010 -0.50886285 -0.78652911 -1.12111489 -1.14624034
## [326] -1.16187449 -1.17991524 -1.20026222 -1.19428750 -1.19026054
## [331] -1.18991215 -1.20730059 -1.22314272 -1.25176347 -1.28131560
## [336] -1.33116930 -1.29066967 -1.23146686 -1.20037697 -1.15468197
## [341] -1.17590974 -1.21206183
pull()
, unlike select()
, does not return a tibble
, but only the column you want.
4.3.4 Group the observations of your dataset with group_by()
group_by()
is a very useful verb; as the name implies, it allows you to create groups and then,
for example, compute descriptive statistics by groups. For example, let’s group our data by
country:
gasoline %>% group_by(country)
## # A tibble: 342 x 6
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
It looks like nothing much happened, but if you look at the second line of the output you can read the following:
## # Groups: country [18]
this means that the data is grouped, and every computation you will do now will take these groups into account. It is also possible to group by more than one variable:
gasoline %>% group_by(country, year)
## # A tibble: 342 x 6
## # Groups: country, year [342]
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
and so on. You can then also ungroup:
gasoline %>% group_by(country, year) %>% ungroup()
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
Once your data is grouped, the operations that will follow will be executed inside each group.
4.3.5 Get summary statistics with summarise()
Ok, now that we have learned the basic verbs, we can start to do more interesting stuff. For example, one might want to compute the average gasoline consumption in each country, for the whole period:
gasoline %>%
group_by(country) %>%
summarise(mean(lgaspcar))
## # A tibble: 18 x 2
## country `mean(lgaspcar)`
## <fct> <dbl>
## 1 AUSTRIA 4.06
## 2 BELGIUM 3.92
## 3 CANADA 4.86
## 4 DENMARK 4.19
## 5 FRANCE 3.82
## 6 GERMANY 3.89
## 7 GREECE 4.88
## 8 IRELAND 4.23
## 9 ITALY 3.73
## 10 JAPAN 4.70
## 11 NETHERLA 4.08
## 12 NORWAY 4.11
## 13 SPAIN 4.06
## 14 SWEDEN 4.01
## 15 SWITZERL 4.24
## 16 TURKEY 5.77
## 17 U.K. 3.98
## 18 U.S.A. 4.82
mean()
was given as an argument to summarise()
, which is a dplyr
verb. What we get is another
tibble, that contains the variable we used to group, as well as the average per country. We can
also rename this column:
gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar))
## # A tibble: 18 x 2
## country mean_gaspcar
## <fct> <dbl>
## 1 AUSTRIA 4.06
## 2 BELGIUM 3.92
## 3 CANADA 4.86
## 4 DENMARK 4.19
## 5 FRANCE 3.82
## 6 GERMANY 3.89
## 7 GREECE 4.88
## 8 IRELAND 4.23
## 9 ITALY 3.73
## 10 JAPAN 4.70
## 11 NETHERLA 4.08
## 12 NORWAY 4.11
## 13 SPAIN 4.06
## 14 SWEDEN 4.01
## 15 SWITZERL 4.24
## 16 TURKEY 5.77
## 17 U.K. 3.98
## 18 U.S.A. 4.82
and because the output is a tibble
, we can continue to use dplyr
verbs on it:
gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar)) %>%
filter(country == "france")
## # A tibble: 0 x 2
## # … with 2 variables: country <fct>, mean_gaspcar <dbl>
summarise()
is a very useful verb. For example, we can compute several descriptive statistics at once:
gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar),
sd_gaspcar = sd(lgaspcar),
max_gaspcar = max(lgaspcar),
min_gaspcar = min(lgaspcar))
## # A tibble: 18 x 5
## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 0.0693 4.20 3.92
## 2 BELGIUM 3.92 0.103 4.16 3.82
## 3 CANADA 4.86 0.0262 4.90 4.81
## 4 DENMARK 4.19 0.158 4.50 4.00
## 5 FRANCE 3.82 0.0499 3.91 3.75
## 6 GERMANY 3.89 0.0239 3.93 3.85
## 7 GREECE 4.88 0.255 5.38 4.48
## 8 IRELAND 4.23 0.0437 4.33 4.16
## 9 ITALY 3.73 0.220 4.05 3.38
## 10 JAPAN 4.70 0.684 6.00 3.95
## 11 NETHERLA 4.08 0.286 4.65 3.71
## 12 NORWAY 4.11 0.123 4.44 3.96
## 13 SPAIN 4.06 0.317 4.75 3.62
## 14 SWEDEN 4.01 0.0364 4.07 3.91
## 15 SWITZERL 4.24 0.102 4.44 4.05
## 16 TURKEY 5.77 0.329 6.16 5.14
## 17 U.K. 3.98 0.0479 4.10 3.91
## 18 U.S.A. 4.82 0.0219 4.86 4.79
Because the output is a tibble
, you can save it in a variable of course:
desc_gasoline <- gasoline %>%
group_by(country) %>%
summarise(mean_gaspcar = mean(lgaspcar),
sd_gaspcar = sd(lgaspcar),
max_gaspcar = max(lgaspcar),
min_gaspcar = min(lgaspcar))
And then you can answer questions such as, which country has the maximum average gasoline consumption?:
desc_gasoline %>%
filter(max(mean_gaspcar) == mean_gaspcar)
## # A tibble: 1 x 5
## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 TURKEY 5.77 0.329 6.16 5.14
Turns out it’s Turkey. What about the minimum consumption?
desc_gasoline %>%
filter(min(mean_gaspcar) == mean_gaspcar)
## # A tibble: 1 x 5
## country mean_gaspcar sd_gaspcar max_gaspcar min_gaspcar
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 ITALY 3.73 0.220 4.05 3.38
Because the output of dplyr
verbs is a tibble, it is possible to continue working with it. This
is one shortcoming of using the base summary()
function. The object returned by that function
is not very easy to manipulate.
4.3.6 Adding columns with mutate()
and transmute()
mutate()
adds a column to the tibble
, which can contain any transformation of any other
variable:
gasoline %>%
group_by(country) %>%
mutate(n())
## # A tibble: 342 x 7
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap `n()`
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77 19
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61 19
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46 19
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34 19
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24 19
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12 19
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02 19
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93 19
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85 19
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79 19
## # … with 332 more rows
Using mutate()
I’ve added a column that counts how many times the country appears in the tibble
,
using n()
, another dplyr
function. There’s also count()
and tally()
, which we are going to
see further down. It is also possible to rename the column on the fly:
gasoline %>%
group_by(country) %>%
mutate(count = n())
## # A tibble: 342 x 7
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap count
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77 19
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61 19
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46 19
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34 19
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24 19
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12 19
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02 19
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93 19
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85 19
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79 19
## # … with 332 more rows
It is possible to do any arbitrary operation:
gasoline %>%
group_by(country) %>%
mutate(spam = exp(lgaspcar + lincomep))
## # A tibble: 342 x 7
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap spam
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77 0.100
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61 0.0978
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46 0.0969
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34 0.0991
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24 0.102
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12 0.104
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02 0.110
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93 0.113
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85 0.115
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79 0.122
## # … with 332 more rows
transmute()
is the same as mutate()
, but only returns the created variable:
gasoline %>%
group_by(country) %>%
transmute(spam = exp(lgaspcar + lincomep))
## # A tibble: 342 x 2
## # Groups: country [18]
## country spam
## <fct> <dbl>
## 1 AUSTRIA 0.100
## 2 AUSTRIA 0.0978
## 3 AUSTRIA 0.0969
## 4 AUSTRIA 0.0991
## 5 AUSTRIA 0.102
## 6 AUSTRIA 0.104
## 7 AUSTRIA 0.110
## 8 AUSTRIA 0.113
## 9 AUSTRIA 0.115
## 10 AUSTRIA 0.122
## # … with 332 more rows
4.3.7 Joining tibble
s with full_join()
, left_join()
, right_join()
and all the others
I will end this section on dplyr
with the very useful verbs: the *_join()
verbs. Let’s first
start by loading another dataset from the plm
package. SumHes
and let’s convert it to tibble
and rename it:
data(SumHes, package = "plm")
pwt <- SumHes %>%
as_tibble() %>%
mutate(country = tolower(country))
Let’s take a quick look at the data:
glimpse(pwt)
## Observations: 3,250
## Variables: 7
## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 19…
## $ country <chr> "algeria", "algeria", "algeria", "algeria", "algeria", "…
## $ opec <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
## $ com <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
## $ pop <int> 10800, 11016, 11236, 11460, 11690, 11923, 12267, 12622, …
## $ gdp <int> 1723, 1599, 1275, 1517, 1589, 1584, 1548, 1600, 1758, 18…
## $ sr <dbl> 19.9, 21.1, 15.0, 13.9, 10.6, 11.0, 8.3, 11.3, 15.1, 18.…
We can merge both gasoline
and pwt
by country and year, as these two variables are common to
both datasets. There are more countries and years in the pwt
dataset, so when merging both, and
depending on which function you use, you will either have NA
’s for the variables where there is
no match, or rows that will be dropped. Let’s start with full_join
:
gas_pwt_full <- gasoline %>%
full_join(pwt, by = c("country", "year"))
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
Let’s see which countries and years are included:
gas_pwt_full %>%
count(country, year)
## # A tibble: 3,592 x 3
## country year n
## <chr> <int> <int>
## 1 algeria 1960 1
## 2 algeria 1961 1
## 3 algeria 1962 1
## 4 algeria 1963 1
## 5 algeria 1964 1
## 6 algeria 1965 1
## 7 algeria 1966 1
## 8 algeria 1967 1
## 9 algeria 1968 1
## 10 algeria 1969 1
## # … with 3,582 more rows
As you see, every country and year was included, but what happened for, say, the U.S.S.R? This country
is in pwt
but not in gasoline
at all:
gas_pwt_full %>%
filter(country == "u.s.s.r.")
## # A tibble: 26 x 11
## country year lgaspcar lincomep lrpmg lcarpcap opec com pop gdp
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <int> <int>
## 1 u.s.s.… 1960 NA NA NA NA no yes 214400 2397
## 2 u.s.s.… 1961 NA NA NA NA no yes 217896 2542
## 3 u.s.s.… 1962 NA NA NA NA no yes 221449 2656
## 4 u.s.s.… 1963 NA NA NA NA no yes 225060 2681
## 5 u.s.s.… 1964 NA NA NA NA no yes 227571 2854
## 6 u.s.s.… 1965 NA NA NA NA no yes 230109 3049
## 7 u.s.s.… 1966 NA NA NA NA no yes 232676 3247
## 8 u.s.s.… 1967 NA NA NA NA no yes 235272 3454
## 9 u.s.s.… 1968 NA NA NA NA no yes 237896 3730
## 10 u.s.s.… 1969 NA NA NA NA no yes 240550 3808
## # … with 16 more rows, and 1 more variable: sr <dbl>
As you probably guessed, the variables from gasoline
that are not included in pwt
are filled with
NA
s. One could remove all these lines and only keep countries for which these variables are not
NA
everywhere with filter()
, but there is a simpler solution:
gas_pwt_inner <- gasoline %>%
inner_join(pwt, by = c("country", "year"))
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
Let’s use the tabyl()
from the janitor
packages which is a very nice alternative to the table()
function from base R:
library(janitor)
gas_pwt_inner %>%
tabyl(country)
## [1] country n percent
## <0 rows> (or 0-length row.names)
Only countries with values in both datasets were returned. It’s almost every country from gasoline
,
apart from Germany (called “germany west” in pwt
and “germany” in gasoline
. I left it as is to
provide an example of a country not in pwt
). Let’s also look at the variables:
glimpse(gas_pwt_inner)
## Observations: 0
## Variables: 11
## $ country <chr>
## $ year <int>
## $ lgaspcar <dbl>
## $ lincomep <dbl>
## $ lrpmg <dbl>
## $ lcarpcap <dbl>
## $ opec <fct>
## $ com <fct>
## $ pop <int>
## $ gdp <int>
## $ sr <dbl>
The variables from both datasets are in the joined data.
Contrast this to semi_join()
:
gas_pwt_semi <- gasoline %>%
semi_join(pwt, by = c("country", "year"))
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
glimpse(gas_pwt_semi)
## Observations: 0
## Variables: 6
## $ country <fct>
## $ year <int>
## $ lgaspcar <dbl>
## $ lincomep <dbl>
## $ lrpmg <dbl>
## $ lcarpcap <dbl>
gas_pwt_semi %>%
tabyl(country)
## Warning: Factor `dat` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## country n percent
## AUSTRIA 0 NaN
## BELGIUM 0 NaN
## CANADA 0 NaN
## DENMARK 0 NaN
## FRANCE 0 NaN
## GERMANY 0 NaN
## GREECE 0 NaN
## IRELAND 0 NaN
## ITALY 0 NaN
## JAPAN 0 NaN
## NETHERLA 0 NaN
## NORWAY 0 NaN
## SPAIN 0 NaN
## SWEDEN 0 NaN
## SWITZERL 0 NaN
## TURKEY 0 NaN
## U.K. 0 NaN
## U.S.A. 0 NaN
Only columns of gasoline
are returned, and only rows of gasoline
that were matched with rows
from pwt
. semi_join()
is not a commutative operation:
pwt_gas_semi <- pwt %>%
semi_join(gasoline, by = c("country", "year"))
## Warning: Column `country` joining character vector and factor, coercing
## into character vector
glimpse(pwt_gas_semi)
## Observations: 0
## Variables: 7
## $ year <int>
## $ country <chr>
## $ opec <fct>
## $ com <fct>
## $ pop <int>
## $ gdp <int>
## $ sr <dbl>
gas_pwt_semi %>%
tabyl(country)
## Warning: Factor `dat` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## country n percent
## AUSTRIA 0 NaN
## BELGIUM 0 NaN
## CANADA 0 NaN
## DENMARK 0 NaN
## FRANCE 0 NaN
## GERMANY 0 NaN
## GREECE 0 NaN
## IRELAND 0 NaN
## ITALY 0 NaN
## JAPAN 0 NaN
## NETHERLA 0 NaN
## NORWAY 0 NaN
## SPAIN 0 NaN
## SWEDEN 0 NaN
## SWITZERL 0 NaN
## TURKEY 0 NaN
## U.K. 0 NaN
## U.S.A. 0 NaN
The rows are the same, but not the columns.
left_join()
and right_join()
return all the rows from either the dataset that is on the
“left” (the first argument of the fonction) or on the “right” (the second argument of the
function) but all columns from both datasets. So depending on which countries you’re interested in,
you’re going to use either one of these functions:
gas_pwt_left <- gasoline %>%
left_join(pwt, by = c("country", "year"))
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
gas_pwt_left %>%
tabyl(country)
## country n percent
## AUSTRIA 19 0.05555556
## BELGIUM 19 0.05555556
## CANADA 19 0.05555556
## DENMARK 19 0.05555556
## FRANCE 19 0.05555556
## GERMANY 19 0.05555556
## GREECE 19 0.05555556
## IRELAND 19 0.05555556
## ITALY 19 0.05555556
## JAPAN 19 0.05555556
## NETHERLA 19 0.05555556
## NORWAY 19 0.05555556
## SPAIN 19 0.05555556
## SWEDEN 19 0.05555556
## SWITZERL 19 0.05555556
## TURKEY 19 0.05555556
## U.K. 19 0.05555556
## U.S.A. 19 0.05555556
gas_pwt_right <- gasoline %>%
right_join(pwt, by = c("country", "year"))
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
gas_pwt_right %>%
tabyl(country)
## country n percent
## algeria 26 0.008
## angola 26 0.008
## argentina 26 0.008
## australia 26 0.008
## austria 26 0.008
## bangladesh 26 0.008
## barbados 26 0.008
## belgium 26 0.008
## benin 26 0.008
## bolivia 26 0.008
## botswana 26 0.008
## brazil 26 0.008
## burkina faso 26 0.008
## burundi 26 0.008
## cameroon 26 0.008
## canada 26 0.008
## cape verde is. 26 0.008
## central afr.r. 26 0.008
## chad 26 0.008
## chile 26 0.008
## china 26 0.008
## colombia 26 0.008
## comoros 26 0.008
## congo 26 0.008
## costa rica 26 0.008
## cyprus 26 0.008
## czechoslovakia 26 0.008
## denmark 26 0.008
## dominican rep. 26 0.008
## ecuador 26 0.008
## egypt 26 0.008
## el salvador 26 0.008
## ethiopia 26 0.008
## fiji 26 0.008
## finland 26 0.008
## france 26 0.008
## gabon 26 0.008
## gambia 26 0.008
## germany west 26 0.008
## ghana 26 0.008
## greece 26 0.008
## guatemala 26 0.008
## guinea 26 0.008
## guinea-biss 26 0.008
## guyana 26 0.008
## haiti 26 0.008
## honduras 26 0.008
## hong kong 26 0.008
## iceland 26 0.008
## india 26 0.008
## indonesia 26 0.008
## iran 26 0.008
## iraq 26 0.008
## ireland 26 0.008
## israel 26 0.008
## italy 26 0.008
## ivory coast 26 0.008
## jamaica 26 0.008
## japan 26 0.008
## jordan 26 0.008
## kenya 26 0.008
## korea 26 0.008
## lesotho 26 0.008
## liberia 26 0.008
## luxembourg 26 0.008
## madagascar 26 0.008
## malawi 26 0.008
## malaysia 26 0.008
## mali 26 0.008
## malta 26 0.008
## mauritania 26 0.008
## mauritius 26 0.008
## mexico 26 0.008
## morocco 26 0.008
## mozambique 26 0.008
## myanmar 26 0.008
## namibia 26 0.008
## nepal 26 0.008
## netherlands 26 0.008
## new zealand 26 0.008
## nicaragua 26 0.008
## niger 26 0.008
## nigeria 26 0.008
## norway 26 0.008
## pakistan 26 0.008
## panama 26 0.008
## papua n.guinea 26 0.008
## paraguay 26 0.008
## peru 26 0.008
## philippines 26 0.008
## portugal 26 0.008
## puerto rico 26 0.008
## reunion 26 0.008
## romania 26 0.008
## rwanda 26 0.008
## saudi arabia 26 0.008
## senegal 26 0.008
## seychelles 26 0.008
## singapore 26 0.008
## somalia 26 0.008
## south africa 26 0.008
## spain 26 0.008
## sri lanka 26 0.008
## suriname 26 0.008
## swaziland 26 0.008
## sweden 26 0.008
## switzerland 26 0.008
## syria 26 0.008
## taiwan 26 0.008
## tanzania 26 0.008
## thailand 26 0.008
## togo 26 0.008
## trinidad&tobago 26 0.008
## tunisia 26 0.008
## turkey 26 0.008
## u.k. 26 0.008
## u.s.a. 26 0.008
## u.s.s.r. 26 0.008
## uganda 26 0.008
## uruguay 26 0.008
## venezuela 26 0.008
## yugoslavia 26 0.008
## zaire 26 0.008
## zambia 26 0.008
## zimbabwe 26 0.008
The last merge function is anti_join()
:
gas_pwt_anti <- gasoline %>%
anti_join(pwt, by = c("country", "year"))
## Warning: Column `country` joining factor and character vector, coercing
## into character vector
glimpse(gas_pwt_anti)
## Observations: 342
## Variables: 6
## $ country <fct> AUSTRIA, AUSTRIA, AUSTRIA, AUSTRIA, AUSTRIA, AUSTRIA, A…
## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1…
## $ lgaspcar <dbl> 4.173244, 4.100989, 4.073177, 4.059509, 4.037689, 4.033…
## $ lincomep <dbl> -6.474277, -6.426006, -6.407308, -6.370679, -6.322247, …
## $ lrpmg <dbl> -0.3345476, -0.3513276, -0.3795177, -0.4142514, -0.4453…
## $ lcarpcap <dbl> -9.766840, -9.608622, -9.457257, -9.343155, -9.237739, …
gas_pwt_anti %>%
tabyl(country)
## country n percent
## AUSTRIA 19 0.05555556
## BELGIUM 19 0.05555556
## CANADA 19 0.05555556
## DENMARK 19 0.05555556
## FRANCE 19 0.05555556
## GERMANY 19 0.05555556
## GREECE 19 0.05555556
## IRELAND 19 0.05555556
## ITALY 19 0.05555556
## JAPAN 19 0.05555556
## NETHERLA 19 0.05555556
## NORWAY 19 0.05555556
## SPAIN 19 0.05555556
## SWEDEN 19 0.05555556
## SWITZERL 19 0.05555556
## TURKEY 19 0.05555556
## U.K. 19 0.05555556
## U.S.A. 19 0.05555556
gas_pwt_anti
has the columns the gasoline
dataset as well as the only country from gasoline
that is not in pwt
: “germany”.
That was it for the basic {dplyr}
verbs. Next, we’re going to learn about {tidyr}
.
4.4 Reshaping data with tidyr
Another important package from the tidyverse
that goes hand in hand with dplyr
is tidyr
. tidyr
is the package you need when it’s time to reshape data. As of March 2019, the development version
of tidyr
introduced two new functions that make reshaping data easier, pivot_long()
and pivot_wide()
.
To install the development version of tidyr
, use the following line:
devtools::install_github("tidyverse/tidyr")
The legacy functions, spread()
and gather()
will remain in the package but their use will be
discouraged.
I will start by presenting pivot_wide()
and pivot_long()
, and then, for reference, will show
how to solve the similar problems using gather()
and spread()
.
4.4.1 pivot_wide()
and pivot_long()
Let’s first create a fake dataset:
library(tidyr)
survey_data <- tribble(
~id, ~variable, ~value,
1, "var1", 1,
1, "var2", 0.2,
NA, "var3", 0.3,
2, "var1", 1.4,
2, "var2", 1.9,
2, "var3", 4.1,
3, "var1", 0.1,
3, "var2", 2.8,
3, "var3", 8.9,
4, "var1", 1.7,
NA, "var2", 1.9,
4, "var3", 7.6
)
head(survey_data)
## # A tibble: 6 x 3
## id variable value
## <dbl> <chr> <dbl>
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1 1.4
## 5 2 var2 1.9
## 6 2 var3 4.1
I used the tribble()
function from the {tibble}
package to create this fake dataset.
I’ll discuss this package later, for now, let’s focus on {tidyr}.
Let’s suppose that we need the data to be in the wide format which means var1
, var2
and var3
need to be their own columns. To do this, we need to use the pivot_wide()
function. Why wide?
Because the data set will be wide, meaning, having more columns than rows.
survey_data %>%
pivot_wide(keys = id, names_from = variable, values_from = value)
## # A tibble: 5 x 4
## id var1 var2 var3
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0.2 NA
## 2 NA NA 1.9 0.3
## 3 2 1.4 1.9 4.1
## 4 3 0.1 2.8 8.9
## 5 4 1.7 NA 7.6
Let’s go through pivot_wide()
’s arguments: the first is keys =
which requires the variable
that uniquely identifies the rows to be supplied. names_from =
is where you input the variable that will
generate the names of the new columns. In our case, the variable
colmuns has three values; var1
,
var2
and var3
, and these are now the names of the new columns. Finally, values_from =
is where
you can specify the column containing the values that will fill the data frame.
I find the argument names names_from =
and values_from =
quite explicit.
As you can see, there are some missing values. Let’s suppose that we know that these missing values
are true 0’s. pivot_wide()
has an argument called values_fill =
that makes it easy to replace
the missing values:
survey_data %>%
pivot_wide(keys = id, names_from = variable, values_from = value, values_fill = list(value = 0))
## # A tibble: 5 x 4
## id var1 var2 var3
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0.2 0
## 2 NA 0 1.9 0.3
## 3 2 1.4 1.9 4.1
## 4 3 0.1 2.8 8.9
## 5 4 1.7 0 7.6
A list of variables and their respective values to replace NA’s with must be supplied to values_fill
.
Let’s now use another dataset, which you can get from
here
(downloaded from: http://www.statistiques.public.lu/stat/TableViewer/tableView.aspx?ReportId=12950&IF_Language=eng&MainTheme=2&FldrName=3&RFPath=91). This data set gives the unemployment rate for each Luxembourguish
canton from 2001 to 2015. We will come back to this data later on to learn how to plot it. For now,
let’s use it to learn more about {tidyr}
.
unemp_lux_data <- rio::import("datasets/unemployment/all/unemployment_lux_all.csv")
head(unemp_lux_data)
## division year active_population of_which_non_wage_earners
## 1 Beaufort 2001 688 85
## 2 Beaufort 2002 742 85
## 3 Beaufort 2003 773 85
## 4 Beaufort 2004 828 80
## 5 Beaufort 2005 866 96
## 6 Beaufort 2006 893 87
## of_which_wage_earners total_employed_population unemployed
## 1 568 653 35
## 2 631 716 26
## 3 648 733 40
## 4 706 786 42
## 5 719 815 51
## 6 746 833 60
## unemployment_rate_in_percent
## 1 5.09
## 2 3.50
## 3 5.17
## 4 5.07
## 5 5.89
## 6 6.72
Now, let’s suppose that for our purposes, it would make more sense to have the data in a wide format,
where columns are “divison times year” and the value is the unemployment rate. This can be easily done
with providing more columns to names_from =
.
unemp_lux_data2 <- unemp_lux_data %>%
filter(year %in% seq(2013, 2017), str_detect(division, ".*ange$"), !str_detect(division, ".*Canton.*")) %>%
select(division, year, unemployment_rate_in_percent) %>%
rowid_to_column()
unemp_lux_data2 %>%
pivot_wide(names_from = c(division, year), values_from = unemployment_rate_in_percent)
## # A tibble: 48 x 49
## rowid Bertrange_2013 Bertrange_2014 Bertrange_2015 Differdange_2013
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 5.69 NA NA NA
## 2 2 NA 5.65 NA NA
## 3 3 NA NA 5.35 NA
## 4 4 NA NA NA 13.2
## 5 5 NA NA NA NA
## 6 6 NA NA NA NA
## 7 7 NA NA NA NA
## 8 8 NA NA NA NA
## 9 9 NA NA NA NA
## 10 10 NA NA NA NA
## # … with 38 more rows, and 44 more variables: Differdange_2014 <dbl>,
## # Differdange_2015 <dbl>, Dudelange_2013 <dbl>, Dudelange_2014 <dbl>,
## # Dudelange_2015 <dbl>, Frisange_2013 <dbl>, Frisange_2014 <dbl>,
## # Frisange_2015 <dbl>, Hesperange_2013 <dbl>, Hesperange_2014 <dbl>,
## # Hesperange_2015 <dbl>, Leudelange_2013 <dbl>, Leudelange_2014 <dbl>,
## # Leudelange_2015 <dbl>, Mondercange_2013 <dbl>, Mondercange_2014 <dbl>,
## # Mondercange_2015 <dbl>, Pétange_2013 <dbl>, Pétange_2014 <dbl>,
## # Pétange_2015 <dbl>, Rumelange_2013 <dbl>, Rumelange_2014 <dbl>,
## # Rumelange_2015 <dbl>, Schifflange_2013 <dbl>, Schifflange_2014 <dbl>,
## # Schifflange_2015 <dbl>, Schuttrange_2013 <dbl>,
## # Schuttrange_2014 <dbl>, Schuttrange_2015 <dbl>, Tuntange_2013 <dbl>,
## # Tuntange_2014 <dbl>, Tuntange_2015 <dbl>, Useldange_2013 <dbl>,
## # Useldange_2014 <dbl>, Useldange_2015 <dbl>, Walferdange_2013 <dbl>,
## # Walferdange_2014 <dbl>, Walferdange_2015 <dbl>, Wincrange_2013 <dbl>,
## # Wincrange_2014 <dbl>, Wincrange_2015 <dbl>, Wormeldange_2013 <dbl>,
## # Wormeldange_2014 <dbl>, Wormeldange_2015 <dbl>
In the filter()
statement, I only kept data from 2013 to 2017, “division”s ending with the string “ange”
(“division” can be a canton or a commune, for example “Canton Redange”, a canton, or “Hesperange” a commune),
and removed the cantons as I’m only interested in communes.
I then only kept the columns I’m interested in and pivoted the data to a wide format.
Also, I needed to add a unique identifier to the data frame. For this, I used rowid_to_column()
function,
from the {tibble}
package, which adds a new column to the data frame with an id, going from 1 to
the number of rows in the data frame. If I did not add this identifier, the statement would work still:
unemp_lux_data3 <- unemp_lux_data %>%
filter(year %in% seq(2013, 2017), str_detect(division, ".*ange$"), !str_detect(division, ".*Canton.*")) %>%
select(division, year, unemployment_rate_in_percent)
unemp_lux_data3 %>%
pivot_wide(names_from = c(division, year), values_from = unemployment_rate_in_percent)
## # A tibble: 1 x 48
## Bertrange_2013 Bertrange_2014 Bertrange_2015 Differdange_2013
## <dbl> <dbl> <dbl> <dbl>
## 1 5.69 5.65 5.35 13.2
## # … with 44 more variables: Differdange_2014 <dbl>,
## # Differdange_2015 <dbl>, Dudelange_2013 <dbl>, Dudelange_2014 <dbl>,
## # Dudelange_2015 <dbl>, Frisange_2013 <dbl>, Frisange_2014 <dbl>,
## # Frisange_2015 <dbl>, Hesperange_2013 <dbl>, Hesperange_2014 <dbl>,
## # Hesperange_2015 <dbl>, Leudelange_2013 <dbl>, Leudelange_2014 <dbl>,
## # Leudelange_2015 <dbl>, Mondercange_2013 <dbl>, Mondercange_2014 <dbl>,
## # Mondercange_2015 <dbl>, Pétange_2013 <dbl>, Pétange_2014 <dbl>,
## # Pétange_2015 <dbl>, Rumelange_2013 <dbl>, Rumelange_2014 <dbl>,
## # Rumelange_2015 <dbl>, Schifflange_2013 <dbl>, Schifflange_2014 <dbl>,
## # Schifflange_2015 <dbl>, Schuttrange_2013 <dbl>,
## # Schuttrange_2014 <dbl>, Schuttrange_2015 <dbl>, Tuntange_2013 <dbl>,
## # Tuntange_2014 <dbl>, Tuntange_2015 <dbl>, Useldange_2013 <dbl>,
## # Useldange_2014 <dbl>, Useldange_2015 <dbl>, Walferdange_2013 <dbl>,
## # Walferdange_2014 <dbl>, Walferdange_2015 <dbl>, Wincrange_2013 <dbl>,
## # Wincrange_2014 <dbl>, Wincrange_2015 <dbl>, Wormeldange_2013 <dbl>,
## # Wormeldange_2014 <dbl>, Wormeldange_2015 <dbl>
and actually look even better, but only because there are no repeated values; there is only one unemployment rate for each “commune times year”. I will come back to this later on, with another example that might be clearer.
You might have noticed that because there is no data for
the years 2016 and 2017, these columns do not appear in the data. But suppose that we need to have
these columns, so that a colleague from another department can fill in the values. This is possible
by providing a data frame with the detailed specifications of the result data frame. This optional
data frame must have at least two columns, .name
, which are the column names you want, and
.value
which contains the values.
unemp_spec <- unemp_lux_data %>%
expand(division, year = c(year, 2016, 2017), .value = "unemployment_rate_in_percent") %>%
unite(".name", division, year, remove = FALSE)
unemp_spec
The spec has more rows than the data frame (80 vs 48), so this will cause issues if you try to use the specification directly:
unemp_lux_data %>%
pivot_wide(spec = unemp_spec)
Error in data.frame(row = row_id, col = col_id) : arguments imply differing number of rows: 80, 48
So to make it work, we first need to create a column that uniquely identifies each row:
unemp_lux_data4 <- unemp_lux_data %>%
select(division, year, unemployment_rate_in_percent) %>%
rowid_to_column() %>%
pivot_wide(spec = unemp_spec)
unemp_lux_data4
## # A tibble: 1,770 x 2,007
## rowid Beaufort_2001 Beaufort_2002 Beaufort_2003 Beaufort_2004
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 5.09 NA NA NA
## 2 2 NA 3.5 NA NA
## 3 3 NA NA 5.17 NA
## 4 4 NA NA NA 5.07
## 5 5 NA NA NA NA
## 6 6 NA NA NA NA
## 7 7 NA NA NA NA
## 8 8 NA NA NA NA
## 9 9 NA NA NA NA
## 10 10 NA NA NA NA
## # … with 1,760 more rows, and 2,002 more variables: Beaufort_2005 <dbl>,
## # Beaufort_2006 <dbl>, Beaufort_2007 <dbl>, Beaufort_2008 <dbl>,
## # Beaufort_2009 <dbl>, Beaufort_2010 <dbl>, Beaufort_2011 <dbl>,
## # Beaufort_2012 <dbl>, Beaufort_2013 <dbl>, Beaufort_2014 <dbl>,
## # Beaufort_2015 <dbl>, Beaufort_2016 <dbl>, Beaufort_2017 <dbl>,
## # Bech_2001 <dbl>, Bech_2002 <dbl>, Bech_2003 <dbl>, Bech_2004 <dbl>,
## # Bech_2005 <dbl>, Bech_2006 <dbl>, Bech_2007 <dbl>, Bech_2008 <dbl>,
## # Bech_2009 <dbl>, Bech_2010 <dbl>, Bech_2011 <dbl>, Bech_2012 <dbl>,
## # Bech_2013 <dbl>, Bech_2014 <dbl>, Bech_2015 <dbl>, Bech_2016 <dbl>,
## # Bech_2017 <dbl>, Beckerich_2001 <dbl>, Beckerich_2002 <dbl>,
## # Beckerich_2003 <dbl>, Beckerich_2004 <dbl>, Beckerich_2005 <dbl>,
## # Beckerich_2006 <dbl>, Beckerich_2007 <dbl>, Beckerich_2008 <dbl>,
## # Beckerich_2009 <dbl>, Beckerich_2010 <dbl>, Beckerich_2011 <dbl>,
## # Beckerich_2012 <dbl>, Beckerich_2013 <dbl>, Beckerich_2014 <dbl>,
## # Beckerich_2015 <dbl>, Beckerich_2016 <dbl>, Beckerich_2017 <dbl>,
## # Berdorf_2001 <dbl>, Berdorf_2002 <dbl>, Berdorf_2003 <dbl>,
## # Berdorf_2004 <dbl>, Berdorf_2005 <dbl>, Berdorf_2006 <dbl>,
## # Berdorf_2007 <dbl>, Berdorf_2008 <dbl>, Berdorf_2009 <dbl>,
## # Berdorf_2010 <dbl>, Berdorf_2011 <dbl>, Berdorf_2012 <dbl>,
## # Berdorf_2013 <dbl>, Berdorf_2014 <dbl>, Berdorf_2015 <dbl>,
## # Berdorf_2016 <dbl>, Berdorf_2017 <dbl>, Bertrange_2001 <dbl>,
## # Bertrange_2002 <dbl>, Bertrange_2003 <dbl>, Bertrange_2004 <dbl>,
## # Bertrange_2005 <dbl>, Bertrange_2006 <dbl>, Bertrange_2007 <dbl>,
## # Bertrange_2008 <dbl>, Bertrange_2009 <dbl>, Bertrange_2010 <dbl>,
## # Bertrange_2011 <dbl>, Bertrange_2012 <dbl>, Bertrange_2013 <dbl>,
## # Bertrange_2014 <dbl>, Bertrange_2015 <dbl>, Bertrange_2016 <dbl>,
## # Bertrange_2017 <dbl>, Bettembourg_2001 <dbl>, Bettembourg_2002 <dbl>,
## # Bettembourg_2003 <dbl>, Bettembourg_2004 <dbl>,
## # Bettembourg_2005 <dbl>, Bettembourg_2006 <dbl>,
## # Bettembourg_2007 <dbl>, Bettembourg_2008 <dbl>,
## # Bettembourg_2009 <dbl>, Bettembourg_2010 <dbl>,
## # Bettembourg_2011 <dbl>, Bettembourg_2012 <dbl>,
## # Bettembourg_2013 <dbl>, Bettembourg_2014 <dbl>,
## # Bettembourg_2015 <dbl>, Bettembourg_2016 <dbl>,
## # Bettembourg_2017 <dbl>, Bettendorf_2001 <dbl>, Bettendorf_2002 <dbl>,
## # …
You can notice that now we have columns for 2016 and 2017 too. Let’s clean the data a little bit more:
unemp_lux_data4 %>%
select(-rowid) %>%
fill(matches(".*"), .direction = "down") %>%
slice(n())
## # A tibble: 1 x 2,006
## Beaufort_2001 Beaufort_2002 Beaufort_2003 Beaufort_2004 Beaufort_2005
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.09 3.5 5.17 5.07 5.89
## # … with 2,001 more variables: Beaufort_2006 <dbl>, Beaufort_2007 <dbl>,
## # Beaufort_2008 <dbl>, Beaufort_2009 <dbl>, Beaufort_2010 <dbl>,
## # Beaufort_2011 <dbl>, Beaufort_2012 <dbl>, Beaufort_2013 <dbl>,
## # Beaufort_2014 <dbl>, Beaufort_2015 <dbl>, Beaufort_2016 <dbl>,
## # Beaufort_2017 <dbl>, Bech_2001 <dbl>, Bech_2002 <dbl>,
## # Bech_2003 <dbl>, Bech_2004 <dbl>, Bech_2005 <dbl>, Bech_2006 <dbl>,
## # Bech_2007 <dbl>, Bech_2008 <dbl>, Bech_2009 <dbl>, Bech_2010 <dbl>,
## # Bech_2011 <dbl>, Bech_2012 <dbl>, Bech_2013 <dbl>, Bech_2014 <dbl>,
## # Bech_2015 <dbl>, Bech_2016 <dbl>, Bech_2017 <dbl>,
## # Beckerich_2001 <dbl>, Beckerich_2002 <dbl>, Beckerich_2003 <dbl>,
## # Beckerich_2004 <dbl>, Beckerich_2005 <dbl>, Beckerich_2006 <dbl>,
## # Beckerich_2007 <dbl>, Beckerich_2008 <dbl>, Beckerich_2009 <dbl>,
## # Beckerich_2010 <dbl>, Beckerich_2011 <dbl>, Beckerich_2012 <dbl>,
## # Beckerich_2013 <dbl>, Beckerich_2014 <dbl>, Beckerich_2015 <dbl>,
## # Beckerich_2016 <dbl>, Beckerich_2017 <dbl>, Berdorf_2001 <dbl>,
## # Berdorf_2002 <dbl>, Berdorf_2003 <dbl>, Berdorf_2004 <dbl>,
## # Berdorf_2005 <dbl>, Berdorf_2006 <dbl>, Berdorf_2007 <dbl>,
## # Berdorf_2008 <dbl>, Berdorf_2009 <dbl>, Berdorf_2010 <dbl>,
## # Berdorf_2011 <dbl>, Berdorf_2012 <dbl>, Berdorf_2013 <dbl>,
## # Berdorf_2014 <dbl>, Berdorf_2015 <dbl>, Berdorf_2016 <dbl>,
## # Berdorf_2017 <dbl>, Bertrange_2001 <dbl>, Bertrange_2002 <dbl>,
## # Bertrange_2003 <dbl>, Bertrange_2004 <dbl>, Bertrange_2005 <dbl>,
## # Bertrange_2006 <dbl>, Bertrange_2007 <dbl>, Bertrange_2008 <dbl>,
## # Bertrange_2009 <dbl>, Bertrange_2010 <dbl>, Bertrange_2011 <dbl>,
## # Bertrange_2012 <dbl>, Bertrange_2013 <dbl>, Bertrange_2014 <dbl>,
## # Bertrange_2015 <dbl>, Bertrange_2016 <dbl>, Bertrange_2017 <dbl>,
## # Bettembourg_2001 <dbl>, Bettembourg_2002 <dbl>,
## # Bettembourg_2003 <dbl>, Bettembourg_2004 <dbl>,
## # Bettembourg_2005 <dbl>, Bettembourg_2006 <dbl>,
## # Bettembourg_2007 <dbl>, Bettembourg_2008 <dbl>,
## # Bettembourg_2009 <dbl>, Bettembourg_2010 <dbl>,
## # Bettembourg_2011 <dbl>, Bettembourg_2012 <dbl>,
## # Bettembourg_2013 <dbl>, Bettembourg_2014 <dbl>,
## # Bettembourg_2015 <dbl>, Bettembourg_2016 <dbl>,
## # Bettembourg_2017 <dbl>, Bettendorf_2001 <dbl>, Bettendorf_2002 <dbl>,
## # Bettendorf_2003 <dbl>, …
We will learn about fill()
, anoher {tidyr}
function a bit later in this chapter, but its basic
purpose is to fill rows with whatever value comes before or after the missing values. slice(n())
then only keeps the last row of the data frame, which is the row that contains all the values (expect
for 2016 and 2017, which has missing values, as we wanted).
Here is another example of the importance of having an identifier column when using a spec:
data(mtcars)
mtcars_spec <- mtcars %>%
expand(am, cyl, .value = "mpg") %>%
unite(".name", am, cyl, remove = FALSE)
mtcars_spec
We can now transform the data:
mtcars %>%
pivot_wide(spec = mtcars_spec)
## # A tibble: 32 x 14
## disp hp drat wt qsec vs gear carb `0_4` `0_6` `0_8` `1_4`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 160 110 3.9 2.62 16.5 0 4 4 NA NA NA NA
## 2 160 110 3.9 2.88 17.0 0 4 4 NA NA NA NA
## 3 108 93 3.85 2.32 18.6 1 4 1 NA NA NA 22.8
## 4 258 110 3.08 3.22 19.4 1 3 1 NA 21.4 NA NA
## 5 360 175 3.15 3.44 17.0 0 3 2 NA NA 18.7 NA
## 6 225 105 2.76 3.46 20.2 1 3 1 NA 18.1 NA NA
## 7 360 245 3.21 3.57 15.8 0 3 4 NA NA 14.3 NA
## 8 147. 62 3.69 3.19 20 1 4 2 24.4 NA NA NA
## 9 141. 95 3.92 3.15 22.9 1 4 2 22.8 NA NA NA
## 10 168. 123 3.92 3.44 18.3 1 4 4 NA 19.2 NA NA
## # … with 22 more rows, and 2 more variables: `1_6` <dbl>, `1_8` <dbl>
As you can see, there are several values of “mpg” for some combinations of “am” times “cyl”. If we remove the other columns, each row will not be uniquely identified anymore, and thus an error will be returned:
mtcars %>%
select(am, cyl, mpg) %>%
pivot_wide(spec = mtcars_spec)
Error in data.frame(row = row_id, col = col_id) : arguments imply differing number of rows: 6, 32
So you have to be careful with this.
pivot_long()
is used when you need to go from a wide to a long dataset, meaning, a dataset where
there are some columns that should not be columns, but rather, the levels of a factor variable.
Let’s suppose that the “am” column is split into two columns, 1
for
automatic and 0
for manual transmissions, and that the values filling these colums are miles per
gallon, “mpg”:
mtcars_wide_am <- mtcars %>%
pivot_wide(names_from = am, values_from = mpg)
mtcars_wide_am %>%
select(`0`, `1`, everything())
## # A tibble: 32 x 11
## `0` `1` cyl disp hp drat wt qsec vs gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 21 6 160 110 3.9 2.62 16.5 0 4 4
## 2 NA 21 6 160 110 3.9 2.88 17.0 0 4 4
## 3 NA 22.8 4 108 93 3.85 2.32 18.6 1 4 1
## 4 21.4 NA 6 258 110 3.08 3.22 19.4 1 3 1
## 5 18.7 NA 8 360 175 3.15 3.44 17.0 0 3 2
## 6 18.1 NA 6 225 105 2.76 3.46 20.2 1 3 1
## 7 14.3 NA 8 360 245 3.21 3.57 15.8 0 3 4
## 8 24.4 NA 4 147. 62 3.69 3.19 20 1 4 2
## 9 22.8 NA 4 141. 95 3.92 3.15 22.9 1 4 2
## 10 19.2 NA 6 168. 123 3.92 3.44 18.3 1 4 4
## # … with 22 more rows
As you can see, the “0” and “1” columns should not be their own columns, unless there is a very specific and good reason they should… but rather, they should be the levels of another column (in our case, “am”).
We can go back to a long dataset like so:
mtcars_wide_am %>%
pivot_long(cols = c(`1`, `0`), names_to = "am", values_to = "mpg") %>%
select(am, mpg, everything())
## # A tibble: 64 x 11
## am mpg cyl disp hp drat wt qsec vs gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4
## 2 0 NA 6 160 110 3.9 2.62 16.5 0 4 4
## 3 1 21 6 160 110 3.9 2.88 17.0 0 4 4
## 4 0 NA 6 160 110 3.9 2.88 17.0 0 4 4
## 5 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1
## 6 0 NA 4 108 93 3.85 2.32 18.6 1 4 1
## 7 1 NA 6 258 110 3.08 3.22 19.4 1 3 1
## 8 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1
## 9 1 NA 8 360 175 3.15 3.44 17.0 0 3 2
## 10 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2
## # … with 54 more rows
In the cols argument, you need to list all the variables that need to be transformed. Only 1
and
0
must be pivoted, so I list them. Just for illustration purposes, imagine that we would need
to pivot 50 columns. It would be faster to list the columns that do not need to be pivoted. This
can be achieved by listing the columns that must be excluded with -
in front, and maybe using
match()
with a regular expression:
mtcars_wide_am %>%
pivot_long(cols = -matches("^[[:alpha:]]"), names_to = "am", values_to = "mpg") %>%
select(am, mpg, everything())
## # A tibble: 64 x 11
## am mpg cyl disp hp drat wt qsec vs gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4
## 2 0 NA 6 160 110 3.9 2.62 16.5 0 4 4
## 3 1 21 6 160 110 3.9 2.88 17.0 0 4 4
## 4 0 NA 6 160 110 3.9 2.88 17.0 0 4 4
## 5 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1
## 6 0 NA 4 108 93 3.85 2.32 18.6 1 4 1
## 7 1 NA 6 258 110 3.08 3.22 19.4 1 3 1
## 8 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1
## 9 1 NA 8 360 175 3.15 3.44 17.0 0 3 2
## 10 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2
## # … with 54 more rows
Every column that starts with a letter is ok, so there is no need to pivot them. I use the match()
function with a regular expression so that I don’t have to type the names of all the columns. select()
is used to re-order the columns, only for viewing purposes
names_to =
takes a string as argument, which will be the name of the name column containing the
levels 0
and 1
, and values_to =
also takes a string as argument, which will be the name of
the column containing the values. Finally, you can see that there are a lot of NA
s in the
output. These can be removed easily:
mtcars_wide_am %>%
pivot_long(cols = c(`1`, `0`), names_to = "am", values_to = "mpg", na.rm = TRUE) %>%
select(am, mpg, everything())
## # A tibble: 32 x 11
## am mpg cyl disp hp drat wt qsec vs gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4
## 2 1 21 6 160 110 3.9 2.88 17.0 0 4 4
## 3 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1
## 4 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1
## 5 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2
## 6 0 18.1 6 225 105 2.76 3.46 20.2 1 3 1
## 7 0 14.3 8 360 245 3.21 3.57 15.8 0 3 4
## 8 0 24.4 4 147. 62 3.69 3.19 20 1 4 2
## 9 0 22.8 4 141. 95 3.92 3.15 22.9 1 4 2
## 10 0 19.2 6 168. 123 3.92 3.44 18.3 1 4 4
## # … with 22 more rows
Now for a more advanced example, let’s suppose that we are dealing with the following wide dataset:
mtcars_wide <- mtcars %>%
pivot_wide(spec = mtcars_spec)
mtcars_wide
## # A tibble: 32 x 14
## disp hp drat wt qsec vs gear carb `0_4` `0_6` `0_8` `1_4`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 160 110 3.9 2.62 16.5 0 4 4 NA NA NA NA
## 2 160 110 3.9 2.88 17.0 0 4 4 NA NA NA NA
## 3 108 93 3.85 2.32 18.6 1 4 1 NA NA NA 22.8
## 4 258 110 3.08 3.22 19.4 1 3 1 NA 21.4 NA NA
## 5 360 175 3.15 3.44 17.0 0 3 2 NA NA 18.7 NA
## 6 225 105 2.76 3.46 20.2 1 3 1 NA 18.1 NA NA
## 7 360 245 3.21 3.57 15.8 0 3 4 NA NA 14.3 NA
## 8 147. 62 3.69 3.19 20 1 4 2 24.4 NA NA NA
## 9 141. 95 3.92 3.15 22.9 1 4 2 22.8 NA NA NA
## 10 168. 123 3.92 3.44 18.3 1 4 4 NA 19.2 NA NA
## # … with 22 more rows, and 2 more variables: `1_6` <dbl>, `1_8` <dbl>
The difficulty here is that we have columns with two levels of information. For instance, the
column “0_4” contains the miles per gallon values for manual cars (0
) with 4
cylinders.
The first step is to first pivot the columns:
mtcars_wide %>%
pivot_long(cols = matches("0|1"), names_to = "am_cyl", values_to = "mpg", na.rm = TRUE) %>%
select(am_cyl, mpg, everything())
## # A tibble: 32 x 10
## am_cyl mpg disp hp drat wt qsec vs gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1_6 21 160 110 3.9 2.62 16.5 0 4 4
## 2 1_6 21 160 110 3.9 2.88 17.0 0 4 4
## 3 1_4 22.8 108 93 3.85 2.32 18.6 1 4 1
## 4 0_6 21.4 258 110 3.08 3.22 19.4 1 3 1
## 5 0_8 18.7 360 175 3.15 3.44 17.0 0 3 2
## 6 0_6 18.1 225 105 2.76 3.46 20.2 1 3 1
## 7 0_8 14.3 360 245 3.21 3.57 15.8 0 3 4
## 8 0_4 24.4 147. 62 3.69 3.19 20 1 4 2
## 9 0_4 22.8 141. 95 3.92 3.15 22.9 1 4 2
## 10 0_6 19.2 168. 123 3.92 3.44 18.3 1 4 4
## # … with 22 more rows
Now we only need to separate the “am_cyl” column into two new columns, “am” and “cyl”:
mtcars_wide %>%
pivot_long(cols = matches("0|1"), names_to = "am_cyl", values_to = "mpg", na.rm = TRUE) %>%
separate(am_cyl, into = c("am", "cyl"), sep = "_") %>%
select(am, cyl, mpg, everything())
## # A tibble: 32 x 11
## am cyl mpg disp hp drat wt qsec vs gear carb
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6 21 160 110 3.9 2.62 16.5 0 4 4
## 2 1 6 21 160 110 3.9 2.88 17.0 0 4 4
## 3 1 4 22.8 108 93 3.85 2.32 18.6 1 4 1
## 4 0 6 21.4 258 110 3.08 3.22 19.4 1 3 1
## 5 0 8 18.7 360 175 3.15 3.44 17.0 0 3 2
## 6 0 6 18.1 225 105 2.76 3.46 20.2 1 3 1
## 7 0 8 14.3 360 245 3.21 3.57 15.8 0 3 4
## 8 0 4 24.4 147. 62 3.69 3.19 20 1 4 2
## 9 0 4 22.8 141. 95 3.92 3.15 22.9 1 4 2
## 10 0 6 19.2 168. 123 3.92 3.44 18.3 1 4 4
## # … with 22 more rows
It is also possible to construct a specification data frame, just like for pivot_wide()
:
mtcars_spec_long <- mtcars_wide %>%
pivot_long_spec(matches("0|1"), values_to = "mpg") %>%
separate(name, c("am", "cyl"), sep = "_")
mtcars_spec_long
## # A tibble: 6 x 4
## .name .value am cyl
## <chr> <chr> <chr> <chr>
## 1 0_4 mpg 0 4
## 2 0_6 mpg 0 6
## 3 0_8 mpg 0 8
## 4 1_4 mpg 1 4
## 5 1_6 mpg 1 6
## 6 1_8 mpg 1 8
This spec can now be specified to pivot_long()
:
mtcars_wide %>%
pivot_long(spec = mtcars_spec_long, na.rm = TRUE) %>%
select(am, cyl, mpg, everything())
## # A tibble: 32 x 11
## am cyl mpg disp hp drat wt qsec vs gear carb
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6 21 160 110 3.9 2.62 16.5 0 4 4
## 2 1 6 21 160 110 3.9 2.88 17.0 0 4 4
## 3 1 4 22.8 108 93 3.85 2.32 18.6 1 4 1
## 4 0 6 21.4 258 110 3.08 3.22 19.4 1 3 1
## 5 0 8 18.7 360 175 3.15 3.44 17.0 0 3 2
## 6 0 6 18.1 225 105 2.76 3.46 20.2 1 3 1
## 7 0 8 14.3 360 245 3.21 3.57 15.8 0 3 4
## 8 0 4 24.4 147. 62 3.69 3.19 20 1 4 2
## 9 0 4 22.8 141. 95 3.92 3.15 22.9 1 4 2
## 10 0 6 19.2 168. 123 3.92 3.44 18.3 1 4 4
## # … with 22 more rows
Defining specifications give a lot of flexibility and in some complicated cases are the way to go.
4.4.2 spread()
and gather()
survey_data
is a long dataset. We can reshape it to be wide using the spread()
function:
wide_data <- survey_data %>%
spread(variable, value)
head(wide_data)
## # A tibble: 5 x 4
## id var1 var2 var3
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0.2 NA
## 2 2 1.4 1.9 4.1
## 3 3 0.1 2.8 8.9
## 4 4 1.7 NA 7.6
## 5 NA NA 1.9 0.3
This means that we spread the column called “variable”, which will produce one column per category of “variable”. Then we fill in the rows with the data contained in the column “value”.
To go from a wide dataset to a long one, we use gather()
:
long_data <- wide_data %>%
gather(variable, value, var1, var2)
print(long_data)
## # A tibble: 10 x 4
## id var3 variable value
## <dbl> <dbl> <chr> <dbl>
## 1 1 NA var1 1
## 2 2 4.1 var1 1.4
## 3 3 8.9 var1 0.1
## 4 4 7.6 var1 1.7
## 5 NA 0.3 var1 NA
## 6 1 NA var2 0.2
## 7 2 4.1 var2 1.9
## 8 3 8.9 var2 2.8
## 9 4 7.6 var2 NA
## 10 NA 0.3 var2 1.9
long_data
and survey_data
are the same datasets, but in a different order.
In the wide_data
tibble
, we had 3 columns: id
, var1
and var2
. We want to stack ‘var1’ and
‘var2’ in a new column, that we choose to call “variable”. This is the “key”. For the value, we are
using the values contained in var1
and var2
. Sometimes using spread()
or gather()
requires
some trial and error. I advise you play around with the examples above to really grasp how these
powerful functions work.
4.4.3 fill()
and full_seq()
fill()
is pretty useful to… fill in missing values. For instance, in survey_data
, some “id”s
are missing:
survey_data
## # A tibble: 12 x 3
## id variable value
## <dbl> <chr> <dbl>
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1 1.4
## 5 2 var2 1.9
## 6 2 var3 4.1
## 7 3 var1 0.1
## 8 3 var2 2.8
## 9 3 var3 8.9
## 10 4 var1 1.7
## 11 NA var2 1.9
## 12 4 var3 7.6
It seems pretty obvious that the first NA
is supposed to be 1
and the second missing is supposed
to be 4
. With fill()
, this is pretty easy to achieve:
survey_data %>%
fill(.direction = "down", variable = "id")
full_seq()
is similar:
full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1)
## [1] "2018-08-01" "2018-08-02" "2018-08-03"
We can add this as the date column to our survey data:
survey_data %>%
mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4))
## # A tibble: 12 x 4
## id variable value date
## <dbl> <chr> <dbl> <date>
## 1 1 var1 1 2018-08-01
## 2 1 var2 0.2 2018-08-02
## 3 NA var3 0.3 2018-08-03
## 4 2 var1 1.4 2018-08-01
## 5 2 var2 1.9 2018-08-02
## 6 2 var3 4.1 2018-08-03
## 7 3 var1 0.1 2018-08-01
## 8 3 var2 2.8 2018-08-02
## 9 3 var3 8.9 2018-08-03
## 10 4 var1 1.7 2018-08-01
## 11 NA var2 1.9 2018-08-02
## 12 4 var3 7.6 2018-08-03
I use the base rep()
function to repeat the date 4 times and then using mutate()
I have added
it the data frame.
Putting all these operations together:
survey_data %>%
fill(.direction = "down", variable = "id") %>%
mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
spread(variable, value)
As you can see, this creates a lot of explicit NA
values. The best would be to fill in the missing
values and add the date column, and then work with that format. For example, to get the average
of the values by date:
survey_data %>%
fill(.direction = "down", variable = "id") %>%
mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
group_by(date) %>%
summarise(mean(value))
Or by date
and variable
:
survey_data %>%
fill(.direction = "down", variable = "id") %>%
mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4)) %>%
group_by(date, variable) %>%
summarise(mean(value))
As you can see, you can chain any {tidyverse}
verbs, wether they come from {dplyr}
or {tidyr}
.
4.4.4 Put order in your columns with separate()
, unite()
, and in your rows with separate_rows()
Sometimes, data can be in a format that makes working with it needlessly painful. For example, you get this:
survey_data_not_tidy
## # A tibble: 12 x 3
## id variable_date value
## <dbl> <chr> <dbl>
## 1 1 1/2018-08-01 1
## 2 1 1/2018-08-02 0.2
## 3 NA 1/2018-08-03 0.3
## 4 2 2/2018-08-01 1.4
## 5 2 2/2018-08-02 1.9
## 6 2 2/2018-08-03 4.1
## 7 3 3/2018-08-01 0.1
## 8 3 3/2018-08-02 2.8
## 9 3 3/2018-08-03 8.9
## 10 4 4/2018-08-01 1.7
## 11 NA 4/2018-08-02 1.9
## 12 4 4/2018-08-03 7.6
Dealing with this is simple, thanks to separate()
:
survey_data_not_tidy %>%
separate(variable_date, into = c("variable", "date"), sep = "/")
## # A tibble: 12 x 4
## id variable date value
## <dbl> <chr> <chr> <dbl>
## 1 1 1 2018-08-01 1
## 2 1 1 2018-08-02 0.2
## 3 NA 1 2018-08-03 0.3
## 4 2 2 2018-08-01 1.4
## 5 2 2 2018-08-02 1.9
## 6 2 2 2018-08-03 4.1
## 7 3 3 2018-08-01 0.1
## 8 3 3 2018-08-02 2.8
## 9 3 3 2018-08-03 8.9
## 10 4 4 2018-08-01 1.7
## 11 NA 4 2018-08-02 1.9
## 12 4 4 2018-08-03 7.6
The variable_date
column gets separated into two columns, variable
and date
. One also needs
to specify the separator, in this case “/”.
unite()
is the reverse operation, which can be useful when you are confronted to this situation:
survey_data2
## # A tibble: 12 x 6
## id variable year month day value
## <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 1 1 2018 08 01 1
## 2 1 1 2018 08 02 0.2
## 3 NA 1 2018 08 03 0.3
## 4 2 2 2018 08 01 1.4
## 5 2 2 2018 08 02 1.9
## 6 2 2 2018 08 03 4.1
## 7 3 3 2018 08 01 0.1
## 8 3 3 2018 08 02 2.8
## 9 3 3 2018 08 03 8.9
## 10 4 4 2018 08 01 1.7
## 11 NA 4 2018 08 02 1.9
## 12 4 4 2018 08 03 7.6
In some situation, it is better to have the date as a single column:
survey_data2 %>%
unite(date, year, month, day, sep = "-")
## # A tibble: 12 x 4
## id variable date value
## <dbl> <chr> <chr> <dbl>
## 1 1 1 2018-08-01 1
## 2 1 1 2018-08-02 0.2
## 3 NA 1 2018-08-03 0.3
## 4 2 2 2018-08-01 1.4
## 5 2 2 2018-08-02 1.9
## 6 2 2 2018-08-03 4.1
## 7 3 3 2018-08-01 0.1
## 8 3 3 2018-08-02 2.8
## 9 3 3 2018-08-03 8.9
## 10 4 4 2018-08-01 1.7
## 11 NA 4 2018-08-02 1.9
## 12 4 4 2018-08-03 7.6
Another awful situation is the following:
survey_data_from_hell
## id variable value
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1, var2, var3 1.4, 1.9, 4.1
## 5 3 var1, var2 0.1, 2.8
## 6 3 var3 8.9
## 7 4 var1 1.7
## 8 NA var2 1.9
## 9 4 var3 7.6
separate_rows()
saves the day:
survey_data_from_hell %>%
separate_rows(variable, value)
## id variable value
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1 1.4
## 5 2 var2 1.9
## 6 2 var3 4.1
## 7 3 var1 0.1
## 8 3 var2 2.8
## 9 3 var3 8.9
## 10 4 var1 1.7
## 11 NA var2 1.9
## 12 4 var3 7.6
So to summarise… you can go from this:
survey_data_from_hell
## id variable value
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1, var2, var3 1.4, 1.9, 4.1
## 5 3 var1, var2 0.1, 2.8
## 6 3 var3 8.9
## 7 4 var1 1.7
## 8 NA var2 1.9
## 9 4 var3 7.6
to this:
survey_data_clean
## # A tibble: 12 x 4
## id variable date value
## <dbl> <chr> <chr> <dbl>
## 1 1 1 2018-08-01 1
## 2 1 1 2018-08-02 0.2
## 3 NA 1 2018-08-03 0.3
## 4 2 2 2018-08-01 1.4
## 5 2 2 2018-08-02 1.9
## 6 2 2 2018-08-03 4.1
## 7 3 3 2018-08-01 0.1
## 8 3 3 2018-08-02 2.8
## 9 3 3 2018-08-03 8.9
## 10 4 4 2018-08-01 1.7
## 11 NA 4 2018-08-02 1.9
## 12 4 4 2018-08-03 7.6
quite easily:
survey_data_from_hell %>%
separate_rows(variable, value, convert = TRUE) %>%
fill(.direction = "down", variable = "id") %>%
mutate(date = rep(full_seq(c(as.Date("2018-08-01"), as.Date("2018-08-03")), 1), 4))
4.4.5 A sneak peek to Chapter 8; nest()
Let’s take a look at our clean survey data:
survey_data_clean
## # A tibble: 12 x 4
## id variable date value
## <dbl> <chr> <chr> <dbl>
## 1 1 1 2018-08-01 1
## 2 1 1 2018-08-02 0.2
## 3 NA 1 2018-08-03 0.3
## 4 2 2 2018-08-01 1.4
## 5 2 2 2018-08-02 1.9
## 6 2 2 2018-08-03 4.1
## 7 3 3 2018-08-01 0.1
## 8 3 3 2018-08-02 2.8
## 9 3 3 2018-08-03 8.9
## 10 4 4 2018-08-01 1.7
## 11 NA 4 2018-08-02 1.9
## 12 4 4 2018-08-03 7.6
{tidyr}
has another very useful function, called nest()
:
survey_data_clean %>%
nest(-id)
## # A tibble: 5 x 2
## id data
## <dbl> <list>
## 1 1 <tibble [2 × 3]>
## 2 NA <tibble [2 × 3]>
## 3 2 <tibble [3 × 3]>
## 4 3 <tibble [3 × 3]>
## 5 4 <tibble [2 × 3]>
You can achieve the same result by using group_by()
first:
survey_data_clean %>%
group_by(id) %>%
nest()
## # A tibble: 5 x 2
## id data
## <dbl> <list>
## 1 1 <tibble [2 × 3]>
## 2 NA <tibble [2 × 3]>
## 3 2 <tibble [3 × 3]>
## 4 3 <tibble [3 × 3]>
## 5 4 <tibble [2 × 3]>
This cerates a new tibble object, whith two columns, one with the id
column, and a new one called
data
. Let’s take a look at the first element of this column:
nested_survey_data <- survey_data_clean %>%
group_by(id) %>%
nest()
nested_survey_data$data[1]
## [[1]]
## # A tibble: 2 x 3
## variable date value
## <chr> <chr> <dbl>
## 1 1 2018-08-01 1
## 2 1 2018-08-02 0.2
As you can see, the first element of the data
is also a tibble! You may wonder why this is useful;
I will give you a small taste of what’s waiting in chapter 9. Imagine that you want to create a
barplot for each id
. For instance, for the first id
(we are going to learn about making plots
with {ggplot2}
in the next chapter. For now, just follow along, even if you don’t understand
everything I write):
survey_data_id1 <- survey_data_clean %>%
filter(id == 1)
Now, let’s create the plot:
ggplot(data = survey_data_id1) +
geom_bar(aes(y = value, x = date, fill = variable), stat = "identity") +
ggtitle("id 1")
Ok great. But now I want to create this plot for each id
… so I have to copy paste this 3 times.
But copy-pasting is error prone. So there are two alternatives; I either write a function that
takes as argument the data and the id
I want to plot and run it for each id
(we will learn to
do this in Chapter 8), or I can use tidyr::nest()
, combined with purrr::map()
. {purrr}
is
another very useful {tidevyrse}
package, and we are going to learn about it in Chapter 9. Again,
just follow along for now:
my_plots <- nested_survey_data %>%
mutate(plot = map2(.x = id,
.y = data,
~ggplot(data = .y) +
geom_bar(aes(y = value, x = date, fill = variable), stat = "identity") +
ggtitle(paste0("id", .x))))
This is some very advanced stuff, and again, do not worry if you don’t understand everything now.
We are going to learn about this in detail in Chapter 9. Let’s go through each line.
In the first line, I have started from my clean data nested_survey_data
and then, using the %>%
and mutate()
I create a new column called plot
. Inside the mutate()
function, I called map2
.
map2
is a {purrr}
function that takes three inputs: .x
, .y
and a function. .x
is the id
column from my data, and .y
is the data
column from nested_survey_data
. The function is the
ggplot I created before. Think of map2()
has a loop over two lists, all while applying a function.
The following illustration, taken from Vaudor (2018) by Lise Vaudor, illustrates this perfectly:
Two inputs go in at the same type, the factory, your function, does what it has to do, and an
output comes out. Forget about the factory’s chimney, which represents walk2()
for now. We’ll
learn about it in due time. By the way, you really should read Lise’s blog, her posts are really
great and informative. It’s in French, but that’s not a problem, you know how to read R code, right?
Here’s a link to her blog.
Now, let’s take a look at the result:
my_plots
## # A tibble: 5 x 3
## id data plot
## <dbl> <list> <list>
## 1 1 <tibble [2 × 3]> <S3: gg>
## 2 NA <tibble [2 × 3]> <S3: gg>
## 3 2 <tibble [3 × 3]> <S3: gg>
## 4 3 <tibble [3 × 3]> <S3: gg>
## 5 4 <tibble [2 × 3]> <S3: gg>
my_plots
is a tibble with three columns: id
, data
and.. plot
! plot
is a very interesting
column. It is a list, where each element is of type S3: gg
. Yes, you guessed it, each element of
that list-column is a ggplot! If you now want to take a look at the plots, it is enough to do this:
my_plots$plot
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
Ok, that was quite complicated, but again, this was only to introduce nest()
and give you a taste
of the power of the {tidyverse}
. By the end of Chapter 9, you’ll be used to this.
That was it for a first introduction to {tidyr}
. Now, it’s time to learn about scoped verbs.
4.5 Scoped {tidyverse}
verbs
Scoped verbs are special versions of the verbs you are now familiar with.
4.5.1 filter_*()
Let’s go back to the gasoline
data from the {plm}
package.
filter()
is not the only filtering verb there is. Suppose that we have a condition that we want
to use to filter out a lot of columns at once. For example, for every column that is of type
numeric
, keep only the lines where the condition value > -8 is satisfied. The next line does
that:
gasoline %>% filter_if( ~all(is.numeric(.)), all_vars(. > -8))
## # A tibble: 30 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 CANADA 1972 4.89 -5.44 -1.10 -7.99
## 2 CANADA 1973 4.90 -5.41 -1.13 -7.94
## 3 CANADA 1974 4.89 -5.42 -1.12 -7.90
## 4 CANADA 1975 4.89 -5.38 -1.19 -7.87
## 5 CANADA 1976 4.84 -5.36 -1.06 -7.81
## 6 CANADA 1977 4.81 -5.34 -1.07 -7.77
## 7 CANADA 1978 4.86 -5.31 -1.07 -7.79
## 8 GERMANY 1978 3.88 -5.56 -0.628 -7.95
## 9 SWEDEN 1975 3.97 -7.68 -2.77 -7.99
## 10 SWEDEN 1976 3.98 -7.67 -2.82 -7.96
## # … with 20 more rows
This is a bit more complicated than before, but let’s go through it, step by step.
filter_if()
needs 3 arguments to work; the data, a
predicate function (a function that returns TRUE
, or FALSE
) which will select the columns we
want to work on, and then the condition. The condition can be applied to all the columns that
were selected by the predicate function (hence the all_vars()
) or only to at least one (you’d use
any_vars()
then). Try to change the condition, or the predicate function, to figure out how
filter_if()
works. The dot is a placeholder that stands for whatever columns where selected.
Another scoped filter()
verb is filter_at()
; it allows the user to filter columns by position:
gasoline %>% filter_at(vars(ends_with("p")), all_vars(. > -8))
## # A tibble: 30 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 CANADA 1972 4.89 -5.44 -1.10 -7.99
## 2 CANADA 1973 4.90 -5.41 -1.13 -7.94
## 3 CANADA 1974 4.89 -5.42 -1.12 -7.90
## 4 CANADA 1975 4.89 -5.38 -1.19 -7.87
## 5 CANADA 1976 4.84 -5.36 -1.06 -7.81
## 6 CANADA 1977 4.81 -5.34 -1.07 -7.77
## 7 CANADA 1978 4.86 -5.31 -1.07 -7.79
## 8 GERMANY 1978 3.88 -5.56 -0.628 -7.95
## 9 SWEDEN 1975 3.97 -7.68 -2.77 -7.99
## 10 SWEDEN 1976 3.98 -7.67 -2.82 -7.96
## # … with 20 more rows
we already know about ends_with()
and starts_with()
. So the above line means “for the columns
whose name end with a ‘p’ only keep the lines where, for all the selected columns, the values are
strictly superior to -8
”. Again, this is not very easy the first time you deal with that, so
play around with it for a bit.
Finally, there is also filter_all()
which, as the name implies, uses all the variables for
the filtering step.
filter_if()
and filter_at()
are very useful when you have very large datasets with a lot of
variables and you want to apply a filtering function only to a subset of them. filter_all()
is
useful if, for example, you only want to keep the positive values for all the columns.
4.5.2 select_*() and rename_*()
Just as with filter()
, select()
also has scoped versions. select_if()
makes it easy to
select columns that satisfy a criterium:
gasoline %>% select_if(is.numeric)
## # A tibble: 342 x 5
## year lgaspcar lincomep lrpmg lcarpcap
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1960 4.17 -6.47 -0.335 -9.77
## 2 1961 4.10 -6.43 -0.351 -9.61
## 3 1962 4.07 -6.41 -0.380 -9.46
## 4 1963 4.06 -6.37 -0.414 -9.34
## 5 1964 4.04 -6.32 -0.445 -9.24
## 6 1965 4.03 -6.29 -0.497 -9.12
## 7 1966 4.05 -6.25 -0.467 -9.02
## 8 1967 4.05 -6.23 -0.506 -8.93
## 9 1968 4.05 -6.21 -0.522 -8.85
## 10 1969 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
You can even pass a further function to select_if()
that will be applied to the selected columns:
gasoline %>% select_if(is.numeric, toupper)
## # A tibble: 342 x 5
## YEAR LGASPCAR LINCOMEP LRPMG LCARPCAP
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1960 4.17 -6.47 -0.335 -9.77
## 2 1961 4.10 -6.43 -0.351 -9.61
## 3 1962 4.07 -6.41 -0.380 -9.46
## 4 1963 4.06 -6.37 -0.414 -9.34
## 5 1964 4.04 -6.32 -0.445 -9.24
## 6 1965 4.03 -6.29 -0.497 -9.12
## 7 1966 4.05 -6.25 -0.467 -9.02
## 8 1967 4.05 -6.23 -0.506 -8.93
## 9 1968 4.05 -6.21 -0.522 -8.85
## 10 1969 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
select_at()
makes it easy to select all the variables that start with “l” for example:
gasoline %>% select_at(vars(starts_with("l")))
## # A tibble: 342 x 4
## lgaspcar lincomep lrpmg lcarpcap
## <dbl> <dbl> <dbl> <dbl>
## 1 4.17 -6.47 -0.335 -9.77
## 2 4.10 -6.43 -0.351 -9.61
## 3 4.07 -6.41 -0.380 -9.46
## 4 4.06 -6.37 -0.414 -9.34
## 5 4.04 -6.32 -0.445 -9.24
## 6 4.03 -6.29 -0.497 -9.12
## 7 4.05 -6.25 -0.467 -9.02
## 8 4.05 -6.23 -0.506 -8.93
## 9 4.05 -6.21 -0.522 -8.85
## 10 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
select_at()
also works if you specify the position of the columns you’re interested in:
gasoline %>% select_at(vars(c(1,2,5)))
## # A tibble: 342 x 3
## country year lrpmg
## <fct> <int> <dbl>
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # … with 332 more rows
Notice that I use a new helper function, vars()
, which allows me to specify which variables I want
to select. This is similar to the helper functions we have seen before, all_vars()
and any_vars()
. So the way to read the above line would be something like “Start with the
gasoline data, then select the variables at position 1, 2, 5”. Knowing how to read your code makes
it even easier to write.
There is also a select_all()
, and you might be wondering why it is useful:
gasoline %>% select_all()
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
This simply returns all the columns of the data frame… but wait, selec_all()
allows you to do
something very useful, which is rename all the columns in bulk:
gasoline %>% select_all(toupper)
## # A tibble: 342 x 6
## COUNTRY YEAR LGASPCAR LINCOMEP LRPMG LCARPCAP
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
You can apply any arbitrary function:
gasoline %>% select_all(funs(paste0("hello_", .)))
## # A tibble: 342 x 6
## hello_country hello_year hello_lgaspcar hello_lincomep hello_lrpmg
## <fct> <int> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335
## 2 AUSTRIA 1961 4.10 -6.43 -0.351
## 3 AUSTRIA 1962 4.07 -6.41 -0.380
## 4 AUSTRIA 1963 4.06 -6.37 -0.414
## 5 AUSTRIA 1964 4.04 -6.32 -0.445
## 6 AUSTRIA 1965 4.03 -6.29 -0.497
## 7 AUSTRIA 1966 4.05 -6.25 -0.467
## 8 AUSTRIA 1967 4.05 -6.23 -0.506
## 9 AUSTRIA 1968 4.05 -6.21 -0.522
## 10 AUSTRIA 1969 4.05 -6.15 -0.559
## # … with 332 more rows, and 1 more variable: hello_lcarpcap <dbl>
I have passed the paste0()
function to the funs()
helper function of select_all()
. funs()
is
used a lot with scoped verbs, just like vars()
, which we have already seen.
The scoped version of rename()
work just as you expect. But I’ll leave these for the exercices
down below.
4.5.3 group_by_*()
To illustrate group_by_*()
I have to first modify the gasoline
data a little bit. As you can
see below, the data column is of type integer:
gasoline
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # … with 332 more rows
Let’s change that to character:
gasoline <- gasoline %>%
mutate(year = as.character(year))
Now, this allows me to do the following:
gasoline %>%
group_by_if(is.character) %>%
summarise(mean(lincomep))
## # A tibble: 19 x 2
## year `mean(lincomep)`
## <chr> <dbl>
## 1 1960 -6.50
## 2 1961 -6.46
## 3 1962 -6.42
## 4 1963 -6.37
## 5 1964 -6.33
## 6 1965 -6.29
## 7 1966 -6.25
## 8 1967 -6.21
## 9 1968 -6.18
## 10 1969 -6.12
## 11 1970 -6.07
## 12 1971 -6.04
## 13 1972 -5.99
## 14 1973 -5.94
## 15 1974 -5.93
## 16 1975 -5.93
## 17 1976 -5.89
## 18 1977 -5.87
## 19 1978 -5.84
This is faster than having to write:
gasoline %>%
group_by(country, year) %>%
summarise(mean(lincomep))
## # A tibble: 342 x 3
## # Groups: country [18]
## country year `mean(lincomep)`
## <fct> <chr> <dbl>
## 1 AUSTRIA 1960 -6.47
## 2 AUSTRIA 1961 -6.43
## 3 AUSTRIA 1962 -6.41
## 4 AUSTRIA 1963 -6.37
## 5 AUSTRIA 1964 -6.32
## 6 AUSTRIA 1965 -6.29
## 7 AUSTRIA 1966 -6.25
## 8 AUSTRIA 1967 -6.23
## 9 AUSTRIA 1968 -6.21
## 10 AUSTRIA 1969 -6.15
## # … with 332 more rows
You may think that having two write the name of two variables is not a huge hassle, which is true.
But imagine that you have dozens of character columns that you want to group by. This is the kind
of situation where group_by_if()
is very useful. If you prefer, you can specify the position of
the columns instead of the name of the columns, with group_by_at()
:
gasoline %>%
group_by_at(vars(1, 2)) %>%
summarise(mean(lincomep))
## # A tibble: 342 x 3
## # Groups: country [18]
## country year `mean(lincomep)`
## <fct> <chr> <dbl>
## 1 AUSTRIA 1960 -6.47
## 2 AUSTRIA 1961 -6.43
## 3 AUSTRIA 1962 -6.41
## 4 AUSTRIA 1963 -6.37
## 5 AUSTRIA 1964 -6.32
## 6 AUSTRIA 1965 -6.29
## 7 AUSTRIA 1966 -6.25
## 8 AUSTRIA 1967 -6.23
## 9 AUSTRIA 1968 -6.21
## 10 AUSTRIA 1969 -6.15
## # … with 332 more rows
There is also a group_by_all()
, but I fail to see the use case…
4.5.4 summarise_()
Just like for filter()
, select()
, and group_by()
, summarise()` comes with scoped versions:
gasoline %>%
group_by(country) %>%
summarise_at(vars(starts_with("l")), mean)
## # A tibble: 18 x 5
## country lgaspcar lincomep lrpmg lcarpcap
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -6.12 -0.486 -8.85
## 2 BELGIUM 3.92 -5.85 -0.326 -8.63
## 3 CANADA 4.86 -5.58 -1.05 -8.08
## 4 DENMARK 4.19 -5.76 -0.358 -8.58
## 5 FRANCE 3.82 -5.87 -0.253 -8.45
## 6 GERMANY 3.89 -5.85 -0.517 -8.51
## 7 GREECE 4.88 -6.61 -0.0339 -10.8
## 8 IRELAND 4.23 -6.44 -0.348 -9.04
## 9 ITALY 3.73 -6.35 -0.152 -8.83
## 10 JAPAN 4.70 -6.25 -0.287 -9.95
## 11 NETHERLA 4.08 -5.92 -0.370 -8.82
## 12 NORWAY 4.11 -5.75 -0.278 -8.77
## 13 SPAIN 4.06 -5.63 0.739 -9.90
## 14 SWEDEN 4.01 -7.82 -2.71 -8.25
## 15 SWITZERL 4.24 -5.93 -0.902 -8.54
## 16 TURKEY 5.77 -7.34 -0.422 -12.5
## 17 U.K. 3.98 -6.02 -0.459 -8.55
## 18 U.S.A. 4.82 -5.45 -1.21 -7.78
See how I managed to summarise every variable in one simple call to summarise_at()
? Simply by
using vars()
and specifying that I was interested in the ones that started with “l” and then I
specified the function I wanted. But what if I wanted to use more than one function to summarise
the data? Very easy:
gasoline %>%
group_by(country) %>%
summarise_at(vars(starts_with("l")), funs(mean, sd, max, min))
## # A tibble: 18 x 17
## country lgaspcar_mean lincomep_mean lrpmg_mean lcarpcap_mean lgaspcar_sd
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -6.12 -0.486 -8.85 0.0693
## 2 BELGIUM 3.92 -5.85 -0.326 -8.63 0.103
## 3 CANADA 4.86 -5.58 -1.05 -8.08 0.0262
## 4 DENMARK 4.19 -5.76 -0.358 -8.58 0.158
## 5 FRANCE 3.82 -5.87 -0.253 -8.45 0.0499
## 6 GERMANY 3.89 -5.85 -0.517 -8.51 0.0239
## 7 GREECE 4.88 -6.61 -0.0339 -10.8 0.255
## 8 IRELAND 4.23 -6.44 -0.348 -9.04 0.0437
## 9 ITALY 3.73 -6.35 -0.152 -8.83 0.220
## 10 JAPAN 4.70 -6.25 -0.287 -9.95 0.684
## 11 NETHER… 4.08 -5.92 -0.370 -8.82 0.286
## 12 NORWAY 4.11 -5.75 -0.278 -8.77 0.123
## 13 SPAIN 4.06 -5.63 0.739 -9.90 0.317
## 14 SWEDEN 4.01 -7.82 -2.71 -8.25 0.0364
## 15 SWITZE… 4.24 -5.93 -0.902 -8.54 0.102
## 16 TURKEY 5.77 -7.34 -0.422 -12.5 0.329
## 17 U.K. 3.98 -6.02 -0.459 -8.55 0.0479
## 18 U.S.A. 4.82 -5.45 -1.21 -7.78 0.0219
## # … with 11 more variables: lincomep_sd <dbl>, lrpmg_sd <dbl>,
## # lcarpcap_sd <dbl>, lgaspcar_max <dbl>, lincomep_max <dbl>,
## # lrpmg_max <dbl>, lcarpcap_max <dbl>, lgaspcar_min <dbl>,
## # lincomep_min <dbl>, lrpmg_min <dbl>, lcarpcap_min <dbl>
Just use vars()
to specify the variables you want to summarise, and then funs()
to list all the
functions you want to use on each of the columns.
But maybe you’re just interested in descriptive statistics for some variables, but not all those
that start with “l”? What if you want to use another pattern? Easy to do with the contains()
helper:
gasoline %>%
group_by(country) %>%
summarise_at(vars(dplyr::contains("car")), funs(mean, sd, max, min))
## # A tibble: 18 x 9
## country lgaspcar_mean lcarpcap_mean lgaspcar_sd lcarpcap_sd lgaspcar_max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -8.85 0.0693 0.473 4.20
## 2 BELGIUM 3.92 -8.63 0.103 0.417 4.16
## 3 CANADA 4.86 -8.08 0.0262 0.195 4.90
## 4 DENMARK 4.19 -8.58 0.158 0.349 4.50
## 5 FRANCE 3.82 -8.45 0.0499 0.344 3.91
## 6 GERMANY 3.89 -8.51 0.0239 0.406 3.93
## 7 GREECE 4.88 -10.8 0.255 0.839 5.38
## 8 IRELAND 4.23 -9.04 0.0437 0.345 4.33
## 9 ITALY 3.73 -8.83 0.220 0.639 4.05
## 10 JAPAN 4.70 -9.95 0.684 1.20 6.00
## 11 NETHER… 4.08 -8.82 0.286 0.617 4.65
## 12 NORWAY 4.11 -8.77 0.123 0.438 4.44
## 13 SPAIN 4.06 -9.90 0.317 0.960 4.75
## 14 SWEDEN 4.01 -8.25 0.0364 0.242 4.07
## 15 SWITZE… 4.24 -8.54 0.102 0.378 4.44
## 16 TURKEY 5.77 -12.5 0.329 0.751 6.16
## 17 U.K. 3.98 -8.55 0.0479 0.281 4.10
## 18 U.S.A. 4.82 -7.78 0.0219 0.162 4.86
## # … with 3 more variables: lcarpcap_max <dbl>, lgaspcar_min <dbl>,
## # lcarpcap_min <dbl>
I used dplyr::contains()
instead of simply contains()
because there’s also a
purrr::contains()
. If you load purrr
after dplyr
, contains()
will actually be
purrr::contains()
and not dplyr::contains()
which causes the above code to fail.
There’s also summarise_if()
:
gasoline %>%
group_by(country) %>%
summarise_if(is.double, funs(mean, sd, min, max))
## # A tibble: 18 x 17
## country lgaspcar_mean lincomep_mean lrpmg_mean lcarpcap_mean lgaspcar_sd
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -6.12 -0.486 -8.85 0.0693
## 2 BELGIUM 3.92 -5.85 -0.326 -8.63 0.103
## 3 CANADA 4.86 -5.58 -1.05 -8.08 0.0262
## 4 DENMARK 4.19 -5.76 -0.358 -8.58 0.158
## 5 FRANCE 3.82 -5.87 -0.253 -8.45 0.0499
## 6 GERMANY 3.89 -5.85 -0.517 -8.51 0.0239
## 7 GREECE 4.88 -6.61 -0.0339 -10.8 0.255
## 8 IRELAND 4.23 -6.44 -0.348 -9.04 0.0437
## 9 ITALY 3.73 -6.35 -0.152 -8.83 0.220
## 10 JAPAN 4.70 -6.25 -0.287 -9.95 0.684
## 11 NETHER… 4.08 -5.92 -0.370 -8.82 0.286
## 12 NORWAY 4.11 -5.75 -0.278 -8.77 0.123
## 13 SPAIN 4.06 -5.63 0.739 -9.90 0.317
## 14 SWEDEN 4.01 -7.82 -2.71 -8.25 0.0364
## 15 SWITZE… 4.24 -5.93 -0.902 -8.54 0.102
## 16 TURKEY 5.77 -7.34 -0.422 -12.5 0.329
## 17 U.K. 3.98 -6.02 -0.459 -8.55 0.0479
## 18 U.S.A. 4.82 -5.45 -1.21 -7.78 0.0219
## # … with 11 more variables: lincomep_sd <dbl>, lrpmg_sd <dbl>,
## # lcarpcap_sd <dbl>, lgaspcar_min <dbl>, lincomep_min <dbl>,
## # lrpmg_min <dbl>, lcarpcap_min <dbl>, lgaspcar_max <dbl>,
## # lincomep_max <dbl>, lrpmg_max <dbl>, lcarpcap_max <dbl>
This allows you to summarise every column that contain real numbers. The difference between
is.double()
and is.numeric()
is that is.numeric()
returns TRUE
for integers too, whereas
is.double()
returns TRUE
for real numbers only (integers are real numbers too, but you know
what I mean). To go faster, you can also use summarise_all()
:
gasoline %>%
select(-year) %>%
group_by(country) %>%
summarise_all(funs(mean, sd, min, max))
## # A tibble: 18 x 17
## country lgaspcar_mean lincomep_mean lrpmg_mean lcarpcap_mean lgaspcar_sd
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 4.06 -6.12 -0.486 -8.85 0.0693
## 2 BELGIUM 3.92 -5.85 -0.326 -8.63 0.103
## 3 CANADA 4.86 -5.58 -1.05 -8.08 0.0262
## 4 DENMARK 4.19 -5.76 -0.358 -8.58 0.158
## 5 FRANCE 3.82 -5.87 -0.253 -8.45 0.0499
## 6 GERMANY 3.89 -5.85 -0.517 -8.51 0.0239
## 7 GREECE 4.88 -6.61 -0.0339 -10.8 0.255
## 8 IRELAND 4.23 -6.44 -0.348 -9.04 0.0437
## 9 ITALY 3.73 -6.35 -0.152 -8.83 0.220
## 10 JAPAN 4.70 -6.25 -0.287 -9.95 0.684
## 11 NETHER… 4.08 -5.92 -0.370 -8.82 0.286
## 12 NORWAY 4.11 -5.75 -0.278 -8.77 0.123
## 13 SPAIN 4.06 -5.63 0.739 -9.90 0.317
## 14 SWEDEN 4.01 -7.82 -2.71 -8.25 0.0364
## 15 SWITZE… 4.24 -5.93 -0.902 -8.54 0.102
## 16 TURKEY 5.77 -7.34 -0.422 -12.5 0.329
## 17 U.K. 3.98 -6.02 -0.459 -8.55 0.0479
## 18 U.S.A. 4.82 -5.45 -1.21 -7.78 0.0219
## # … with 11 more variables: lincomep_sd <dbl>, lrpmg_sd <dbl>,
## # lcarpcap_sd <dbl>, lgaspcar_min <dbl>, lincomep_min <dbl>,
## # lrpmg_min <dbl>, lcarpcap_min <dbl>, lgaspcar_max <dbl>,
## # lincomep_max <dbl>, lrpmg_max <dbl>, lcarpcap_max <dbl>
I removed the year
variable because it’s not a variable for which we want to have descriptive
statistics.
4.5.5 mutate_*()
Of course, mutate()
and transmute()
also come with scoped versions:
gasoline %>%
mutate_if(is.double, exp)
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 64.9 0.00154 0.716 0.0000573
## 2 AUSTRIA 1961 60.4 0.00162 0.704 0.0000671
## 3 AUSTRIA 1962 58.7 0.00165 0.684 0.0000781
## 4 AUSTRIA 1963 57.9 0.00171 0.661 0.0000876
## 5 AUSTRIA 1964 56.7 0.00180 0.641 0.0000973
## 6 AUSTRIA 1965 56.5 0.00185 0.608 0.000109
## 7 AUSTRIA 1966 57.3 0.00193 0.627 0.000121
## 8 AUSTRIA 1967 57.6 0.00196 0.603 0.000132
## 9 AUSTRIA 1968 57.1 0.00202 0.593 0.000144
## 10 AUSTRIA 1969 57.2 0.00213 0.572 0.000152
## # … with 332 more rows
gasoline %>%
mutate_at(vars(starts_with("l")), exp)
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 64.9 0.00154 0.716 0.0000573
## 2 AUSTRIA 1961 60.4 0.00162 0.704 0.0000671
## 3 AUSTRIA 1962 58.7 0.00165 0.684 0.0000781
## 4 AUSTRIA 1963 57.9 0.00171 0.661 0.0000876
## 5 AUSTRIA 1964 56.7 0.00180 0.641 0.0000973
## 6 AUSTRIA 1965 56.5 0.00185 0.608 0.000109
## 7 AUSTRIA 1966 57.3 0.00193 0.627 0.000121
## 8 AUSTRIA 1967 57.6 0.00196 0.603 0.000132
## 9 AUSTRIA 1968 57.1 0.00202 0.593 0.000144
## 10 AUSTRIA 1969 57.2 0.00213 0.572 0.000152
## # … with 332 more rows
gasoline %>%
mutate_all(as.character)
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 AUSTRIA 1960 4.173244195 -6.474277179 -0.334547613 -9.766839569
## 2 AUSTRIA 1961 4.1009891049 -6.426005835 -0.351327614 -9.608621845
## 3 AUSTRIA 1962 4.0731765511 -6.407308295 -0.379517692 -9.457256552
## 4 AUSTRIA 1963 4.0595091239 -6.370678539 -0.414251392 -9.343154947
## 5 AUSTRIA 1964 4.037688787 -6.322246805 -0.445335362 -9.237739346
## 6 AUSTRIA 1965 4.033983285 -6.294667914 -0.497060662 -9.123903477
## 7 AUSTRIA 1966 4.0475365589 -6.252545451 -0.466837731 -9.019822048
## 8 AUSTRIA 1967 4.0529106939 -6.234580709 -0.505883405 -8.934402537
## 9 AUSTRIA 1968 4.045507048 -6.206894403 -0.522412545 -8.847967407
## 10 AUSTRIA 1969 4.0463547891 -6.153139668 -0.559110514 -8.788686207
## # … with 332 more rows
I think that by now, you are able to understand the above lines quite easily. If not, try some other functions, or mutating other variables and you’ll see that it is not that complicated!
4.6 Other useful {tidyverse}
functions
4.6.1 if_else()
, case_when()
and recode()
Some other very useful {tidyverse}
functions are if_else()
and case_when
. These two
functions, combined with mutate()
make it easy to create a new variable whose values must
respect certain conditions. For instance, we might want to have a dummy that equals 1
if a country
in the European Union (to simplify, say as of 2017) and 0
if not. First let’s create a list of
countries that are in the EU:
eu_countries <- c("austria", "belgium", "bulgaria", "croatia", "republic of cyprus",
"czech republic", "denmark", "estonia", "finland", "france", "germany",
"greece", "hungary", "ireland", "italy", "latvia", "lithuania", "luxembourg",
"malta", "netherla", "poland", "portugal", "romania", "slovakia", "slovenia",
"spain", "sweden", "u.k.")
I’ve had to change “netherlands” to “netherla” because that’s how the country is called in the
gasoline
data. Now let’s create a dummy variable that equals 1
for EU countries, and 0
for the others:
gasoline %>%
mutate(country = tolower(country)) %>%
mutate(in_eu = if_else(country %in% eu_countries, 1, 0))
## # A tibble: 342 x 7
## country year lgaspcar lincomep lrpmg lcarpcap in_eu
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 1
## 2 austria 1961 4.10 -6.43 -0.351 -9.61 1
## 3 austria 1962 4.07 -6.41 -0.380 -9.46 1
## 4 austria 1963 4.06 -6.37 -0.414 -9.34 1
## 5 austria 1964 4.04 -6.32 -0.445 -9.24 1
## 6 austria 1965 4.03 -6.29 -0.497 -9.12 1
## 7 austria 1966 4.05 -6.25 -0.467 -9.02 1
## 8 austria 1967 4.05 -6.23 -0.506 -8.93 1
## 9 austria 1968 4.05 -6.21 -0.522 -8.85 1
## 10 austria 1969 4.05 -6.15 -0.559 -8.79 1
## # … with 332 more rows
Instead of 1
and 0
, we can of course use strings (I add filter(year == 1960)
at the end to
have a better view of what happened):
gasoline %>%
mutate(country = tolower(country)) %>%
mutate(in_eu = if_else(country %in% eu_countries, "yes", "no")) %>%
filter(year == 1960)
## # A tibble: 18 x 7
## country year lgaspcar lincomep lrpmg lcarpcap in_eu
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 yes
## 2 belgium 1960 4.16 -6.22 -0.166 -9.41 yes
## 3 canada 1960 4.86 -5.89 -0.972 -8.38 no
## 4 denmark 1960 4.50 -6.06 -0.196 -9.33 yes
## 5 france 1960 3.91 -6.26 -0.0196 -9.15 yes
## 6 germany 1960 3.92 -6.16 -0.186 -9.34 yes
## 7 greece 1960 5.04 -7.16 -0.0835 -12.2 yes
## 8 ireland 1960 4.27 -6.72 -0.0765 -9.70 yes
## 9 italy 1960 4.05 -6.73 0.165 -10.1 yes
## 10 japan 1960 6.00 -6.99 -0.145 -12.2 no
## 11 netherla 1960 4.65 -6.22 -0.201 -10.00 yes
## 12 norway 1960 4.44 -6.09 -0.140 -9.68 no
## 13 spain 1960 4.75 -6.17 1.13 -11.6 yes
## 14 sweden 1960 4.06 -8.07 -2.52 -8.74 yes
## 15 switzerl 1960 4.40 -6.16 -0.823 -9.26 no
## 16 turkey 1960 6.13 -7.80 -0.253 -13.5 no
## 17 u.k. 1960 4.10 -6.19 -0.391 -9.12 yes
## 18 u.s.a. 1960 4.82 -5.70 -1.12 -8.02 no
I think that if_else()
is fairly straightforward, especially if you know ifelse()
already. You
might be wondering what is the difference between these two. if_else()
is stricter than
ifelse()
and does not do type conversion. Compare the two next lines:
ifelse(1 == 1, "0", 1)
## [1] "0"
if_else(1 == 1, "0", 1)
Error: `false` must be type string, not double
Type conversion, especially without a warning is very dangerous. if_else()
’s behaviour which
consists in failing as soon as possble avoids a lot of pain and suffering, especially when
programming non-interactively.
if_else()
also accepts an optional argument, that allows you to specify what should be returned
in case of NA
:
if_else(1 <= NA, 0, 1, 999)
## [1] 999
# Or
if_else(1 <= NA, 0, 1, NA_real_)
## [1] NA
case_when()
can be seen as a generalization of if_else()
. Whenever you want to use multiple
if_else()
s, that’s when you know you should use case_when()
(I’m adding the filter at the end
for the same reason as before, to see the output better):
gasoline %>%
mutate(country = tolower(country)) %>%
mutate(region = case_when(
country %in% c("france", "italy", "turkey", "greece", "spain") ~ "mediterranean",
country %in% c("germany", "austria", "switzerl", "belgium", "netherla") ~ "central europe",
country %in% c("canada", "u.s.a.", "u.k.", "ireland") ~ "anglosphere",
country %in% c("denmark", "norway", "sweden") ~ "nordic",
country %in% c("japan") ~ "asia")) %>%
filter(year == 1960)
## # A tibble: 18 x 7
## country year lgaspcar lincomep lrpmg lcarpcap region
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 central europe
## 2 belgium 1960 4.16 -6.22 -0.166 -9.41 central europe
## 3 canada 1960 4.86 -5.89 -0.972 -8.38 anglosphere
## 4 denmark 1960 4.50 -6.06 -0.196 -9.33 nordic
## 5 france 1960 3.91 -6.26 -0.0196 -9.15 mediterranean
## 6 germany 1960 3.92 -6.16 -0.186 -9.34 central europe
## 7 greece 1960 5.04 -7.16 -0.0835 -12.2 mediterranean
## 8 ireland 1960 4.27 -6.72 -0.0765 -9.70 anglosphere
## 9 italy 1960 4.05 -6.73 0.165 -10.1 mediterranean
## 10 japan 1960 6.00 -6.99 -0.145 -12.2 asia
## 11 netherla 1960 4.65 -6.22 -0.201 -10.00 central europe
## 12 norway 1960 4.44 -6.09 -0.140 -9.68 nordic
## 13 spain 1960 4.75 -6.17 1.13 -11.6 mediterranean
## 14 sweden 1960 4.06 -8.07 -2.52 -8.74 nordic
## 15 switzerl 1960 4.40 -6.16 -0.823 -9.26 central europe
## 16 turkey 1960 6.13 -7.80 -0.253 -13.5 mediterranean
## 17 u.k. 1960 4.10 -6.19 -0.391 -9.12 anglosphere
## 18 u.s.a. 1960 4.82 -5.70 -1.12 -8.02 anglosphere
If all you want is to recode values, you can use recode()
. For example, the Netherlands is
written as “NETHERLA” in the gasoline
data, which is quite ugly. Same for Switzerland:
gasoline <- gasoline %>%
mutate(country = tolower(country)) %>%
mutate(country = recode(country, "netherla" = "netherlands", "switzerl" = "switzerland"))
I saved the data with these changes as they will become useful in the future. Let’s take a look at the data:
gasoline %>%
filter(country %in% c("netherlands", "switzerland"), year == 1960)
## # A tibble: 2 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 netherlands 1960 4.65 -6.22 -0.201 -10.00
## 2 switzerland 1960 4.40 -6.16 -0.823 -9.26
4.6.2 lead()
and lag()
lead()
and lag()
are especially useful in econometrics. When I was doing my masters, in 4 B.d.
(Before dplyr) lagging variables in panel data was quite tricky. Now, with dplyr
it’s really
very easy:
gasoline %>%
group_by(country) %>%
mutate(lag_lgaspcar = lag(lgaspcar)) %>%
mutate(lead_lgaspcar = lead(lgaspcar)) %>%
filter(year %in% seq(1960, 1963))
## # A tibble: 72 x 8
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap lag_lgaspcar
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 NA
## 2 austria 1961 4.10 -6.43 -0.351 -9.61 4.17
## 3 austria 1962 4.07 -6.41 -0.380 -9.46 4.10
## 4 austria 1963 4.06 -6.37 -0.414 -9.34 4.07
## 5 belgium 1960 4.16 -6.22 -0.166 -9.41 NA
## 6 belgium 1961 4.12 -6.18 -0.172 -9.30 4.16
## 7 belgium 1962 4.08 -6.13 -0.222 -9.22 4.12
## 8 belgium 1963 4.00 -6.09 -0.250 -9.11 4.08
## 9 canada 1960 4.86 -5.89 -0.972 -8.38 NA
## 10 canada 1961 4.83 -5.88 -0.972 -8.35 4.86
## # … with 62 more rows, and 1 more variable: lead_lgaspcar <dbl>
To lag every variable, remember that you can use mutate_if()
:
gasoline %>%
group_by(country) %>%
mutate_if(is.double, lag) %>%
filter(year %in% seq(1960, 1963))
## `mutate_if()` ignored the following grouping variables:
## Column `country`
## # A tibble: 72 x 6
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77
## 2 austria 1961 4.10 -6.43 -0.351 -9.61
## 3 austria 1962 4.07 -6.41 -0.380 -9.46
## 4 austria 1963 4.06 -6.37 -0.414 -9.34
## 5 belgium 1960 4.16 -6.22 -0.166 -9.41
## 6 belgium 1961 4.12 -6.18 -0.172 -9.30
## 7 belgium 1962 4.08 -6.13 -0.222 -9.22
## 8 belgium 1963 4.00 -6.09 -0.250 -9.11
## 9 canada 1960 4.86 -5.89 -0.972 -8.38
## 10 canada 1961 4.83 -5.88 -0.972 -8.35
## # … with 62 more rows
you can replace lag()
with lead()
, but just keep in mind that the columns get transformed in
place.
4.6.3 ntile()
The last helper function I will discuss is ntile()
. There are some other, so do read mutate()
’s
documentation with help(mutate)
!
If you need quantiles, you need ntile()
. Let’s see how it works:
gasoline %>%
mutate(quintile = ntile(lgaspcar, 5)) %>%
mutate(decile = ntile(lgaspcar, 10)) %>%
select(country, year, lgaspcar, quintile, decile)
## # A tibble: 342 x 5
## country year lgaspcar quintile decile
## <chr> <dbl> <dbl> <int> <int>
## 1 austria 1960 4.17 3 6
## 2 austria 1961 4.10 3 6
## 3 austria 1962 4.07 3 5
## 4 austria 1963 4.06 3 5
## 5 austria 1964 4.04 3 5
## 6 austria 1965 4.03 3 5
## 7 austria 1966 4.05 3 5
## 8 austria 1967 4.05 3 5
## 9 austria 1968 4.05 3 5
## 10 austria 1969 4.05 3 5
## # … with 332 more rows
quintile
and decile
do not hold the values but the quantile the value lies in. If you want to
have a column that contains the median for instance, you can use good ol’ quantile()
:
gasoline %>%
group_by(country) %>%
mutate(median = quantile(lgaspcar, 0.5)) %>% # quantile(x, 0.5) is equivalent to median(x)
filter(year == 1960) %>%
select(country, year, median)
## # A tibble: 18 x 3
## # Groups: country [18]
## country year median
## <chr> <dbl> <dbl>
## 1 austria 1960 4.05
## 2 belgium 1960 3.88
## 3 canada 1960 4.86
## 4 denmark 1960 4.16
## 5 france 1960 3.81
## 6 germany 1960 3.89
## 7 greece 1960 4.89
## 8 ireland 1960 4.22
## 9 italy 1960 3.74
## 10 japan 1960 4.52
## 11 netherlands 1960 3.99
## 12 norway 1960 4.08
## 13 spain 1960 3.99
## 14 sweden 1960 4.00
## 15 switzerland 1960 4.26
## 16 turkey 1960 5.72
## 17 u.k. 1960 3.98
## 18 u.s.a. 1960 4.81
4.6.4 arrange()
arrange()
re-orders the whole tibble
according to values of the supplied variable:
gasoline %>%
arrange(lgaspcar)
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 italy 1977 3.38 -6.10 0.164 -8.15
## 2 italy 1978 3.39 -6.08 0.0348 -8.11
## 3 italy 1976 3.43 -6.12 0.103 -8.17
## 4 italy 1974 3.50 -6.13 -0.223 -8.26
## 5 italy 1975 3.52 -6.17 -0.0327 -8.22
## 6 spain 1978 3.62 -5.29 0.621 -8.63
## 7 italy 1972 3.63 -6.21 -0.215 -8.38
## 8 italy 1971 3.65 -6.22 -0.148 -8.47
## 9 spain 1977 3.65 -5.30 0.526 -8.73
## 10 italy 1973 3.65 -6.16 -0.325 -8.32
## # … with 332 more rows
If you want to re-order the tibble
in descending order of the variable:
gasoline %>%
arrange(desc(lgaspcar))
## # A tibble: 342 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 turkey 1966 6.16 -7.51 -0.356 -13.0
## 2 turkey 1960 6.13 -7.80 -0.253 -13.5
## 3 turkey 1961 6.11 -7.79 -0.343 -13.4
## 4 turkey 1962 6.08 -7.84 -0.408 -13.2
## 5 turkey 1968 6.08 -7.42 -0.365 -12.8
## 6 turkey 1963 6.08 -7.63 -0.225 -13.3
## 7 turkey 1964 6.06 -7.63 -0.252 -13.2
## 8 turkey 1967 6.04 -7.46 -0.335 -12.8
## 9 japan 1960 6.00 -6.99 -0.145 -12.2
## 10 turkey 1965 5.82 -7.62 -0.293 -12.9
## # … with 332 more rows
arrange
’s documentation alerts the user that re-ording by group is only possible by explicitely
specifying an option:
gasoline %>%
filter(year %in% seq(1960, 1963)) %>%
group_by(country) %>%
arrange(desc(lgaspcar), .by_group = TRUE)
## # A tibble: 72 x 6
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77
## 2 austria 1961 4.10 -6.43 -0.351 -9.61
## 3 austria 1962 4.07 -6.41 -0.380 -9.46
## 4 austria 1963 4.06 -6.37 -0.414 -9.34
## 5 belgium 1960 4.16 -6.22 -0.166 -9.41
## 6 belgium 1961 4.12 -6.18 -0.172 -9.30
## 7 belgium 1962 4.08 -6.13 -0.222 -9.22
## 8 belgium 1963 4.00 -6.09 -0.250 -9.11
## 9 canada 1960 4.86 -5.89 -0.972 -8.38
## 10 canada 1962 4.85 -5.84 -0.979 -8.32
## # … with 62 more rows
This is especially useful for plotting. We’ll see this in Chapter 6.
4.6.5 tally()
and count()
tally()
and count()
count the number of observations in your data. I believe count()
is the
more useful of the two, as it counts the number of observations within a group that you can provide:
gasoline %>%
count(country)
## # A tibble: 18 x 2
## country n
## <chr> <int>
## 1 austria 19
## 2 belgium 19
## 3 canada 19
## 4 denmark 19
## 5 france 19
## 6 germany 19
## 7 greece 19
## 8 ireland 19
## 9 italy 19
## 10 japan 19
## 11 netherlands 19
## 12 norway 19
## 13 spain 19
## 14 sweden 19
## 15 switzerland 19
## 16 turkey 19
## 17 u.k. 19
## 18 u.s.a. 19
There’s also add_count()
which adds the column to the data:
gasoline %>%
add_count(country)
## # A tibble: 342 x 7
## country year lgaspcar lincomep lrpmg lcarpcap n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19
## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19
## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19
## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19
## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19
## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19
## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19
## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19
## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19
## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19
## # … with 332 more rows
add_count()
is a shortcut for the following code:
gasoline %>%
group_by(country) %>%
mutate(n = n())
## # A tibble: 342 x 7
## # Groups: country [18]
## country year lgaspcar lincomep lrpmg lcarpcap n
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 austria 1960 4.17 -6.47 -0.335 -9.77 19
## 2 austria 1961 4.10 -6.43 -0.351 -9.61 19
## 3 austria 1962 4.07 -6.41 -0.380 -9.46 19
## 4 austria 1963 4.06 -6.37 -0.414 -9.34 19
## 5 austria 1964 4.04 -6.32 -0.445 -9.24 19
## 6 austria 1965 4.03 -6.29 -0.497 -9.12 19
## 7 austria 1966 4.05 -6.25 -0.467 -9.02 19
## 8 austria 1967 4.05 -6.23 -0.506 -8.93 19
## 9 austria 1968 4.05 -6.21 -0.522 -8.85 19
## 10 austria 1969 4.05 -6.15 -0.559 -8.79 19
## # … with 332 more rows
where n()
is a dplyr
function that can only be used within summarise()
, mutate()
and
filter()
.
4.7 Special packages for special kinds of data: {forcats}
, {lubridate}
, and {stringr}
4.7.1 🐈🐈🐈🐈
Factor variables are very useful but not very easy to manipulate. forcats
contains very useful
functions that make working on factor variables painless. In my opinion, the four following functions, fct_recode()
, fct_relevel()
, fct_reorder()
and fct_relabel()
, are the ones you must
know, so that’s what I’ll be showing.
Remember in chapter 3 when I very quickly explained what were factor
variables? In this section,
we are going to work a little bit with these type of variable. factor
s are very useful, and the
forcats
package includes some handy functions to work with them. First, let’s load the forcats
package:
library(forcats)
as an example, we are going to work with the gss_cat
dataset that is included in forcats
. Let’s
load the data:
data(gss_cat)
head(gss_cat)
## # A tibble: 6 x 9
## year marital age race rincome partyid relig denom tvhours
## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 2000 Never ma… 26 White $8000 to… Ind,near … Protest… Southe… 12
## 2 2000 Divorced 48 White $8000 to… Not str r… Protest… Baptis… NA
## 3 2000 Widowed 67 White Not appl… Independe… Protest… No den… 2
## 4 2000 Never ma… 39 White Not appl… Ind,near … Orthodo… Not ap… 4
## 5 2000 Divorced 25 White Not appl… Not str d… None Not ap… 1
## 6 2000 Married 25 White $20000 -… Strong de… Protest… Southe… NA
as you can see, marital
, race
, rincome
and partyid
are all factor variables. Let’s take a closer
look at marital
:
str(gss_cat$marital)
## Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...
and let’s see rincome
:
str(gss_cat$rincome)
## Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
factor
variables have different levels and the forcats
package includes functions that allow
you to recode, collapse and do all sorts of things on these levels. For example , using
forcats::fct_recode()
you can recode levels:
gss_cat <- gss_cat %>%
mutate(marital = fct_recode(marital,
refuse = "No answer",
never_married = "Never married",
divorced = "Separated",
divorced = "Divorced",
widowed = "Widowed",
married = "Married"))
gss_cat %>%
tabyl(marital)
## marital n percent
## refuse 17 0.0007913234
## never_married 5416 0.2521063166
## divorced 4126 0.1920588372
## widowed 1807 0.0841130196
## married 10117 0.4709305032
Using fct_recode()
, I was able to recode the levels and collapse Separated
and Divorced
to
a single category called divorced
. As you can see, refuse
and widowed
are less than 10%, so
maybe you’d want to lump these categories together:
gss_cat <- gss_cat %>%
mutate(marital = fct_lump(marital, prop = 0.10, other_level = "other"))
gss_cat %>%
tabyl(marital)
## marital n percent
## never_married 5416 0.25210632
## divorced 4126 0.19205884
## married 10117 0.47093050
## other 1824 0.08490434
fct_reorder()
is especially useful for plotting. We will explore plotting in the next chapter,
but to show you why fct_reorder()
is so useful, I will create a barplot, first without
using fct_reorder()
to re-order the factors, then with reordering. Do not worry if you don’t
understand all the code for now:
gss_cat %>%
tabyl(marital) %>%
ggplot() +
geom_col(aes(y = n, x = marital)) +
coord_flip()
It would be much better if the categories were ordered by frequency. This is easy to do with
fct_reorder()
:
gss_cat %>%
tabyl(marital) %>%
mutate(marital = fct_reorder(marital, n, .desc = FALSE)) %>%
ggplot() +
geom_col(aes(y = n, x = marital)) +
coord_flip()
Much better! In Chapter 6, we are going to learn about {ggplot2}
.
{forcats}
contains other very useful functions, so I urge you to go through the documentation.
4.7.2 Get your dates right with {lubridate}
{lubridate}
is yet another tidyverse package, that makes dealing with dates or durations (and intervals) as
painless as possible. I do not use every function contained in the package daily, and as such will
only focus on some of the functions. However, if you have to deal with dates often,
you might want to explore the package thouroughly.
4.7.2.1 Defining dates, the tidy way
Let’s load new dataset, called independence from the datasets folder:
independence <- readRDS("datasets/independence.rds")
This dataset was scraped from the following Wikipedia page. It shows when African countries gained independence from which colonial powers. In Chapter 11, I will show you how to scrape Wikipedia pages using R. For now, let’s take a look at the contents of the dataset:
independence
## # A tibble: 54 x 6
## country colonial_name colonial_power independence_da… first_head_of_s…
## <chr> <chr> <chr> <chr> <chr>
## 1 Liberia Liberia United States 26 July 1847 Joseph Jenkins …
## 2 South … Cape Colony … United Kingdom 31 May 1910 Louis Botha
## 3 Egypt Sultanate of… United Kingdom 28 February 1922 Fuad I
## 4 Eritrea Italian Erit… Italy 10 February 1947 Haile Selassie
## 5 Libya British Mili… United Kingdo… 24 December 1951 Idris
## 6 Sudan Anglo-Egypti… United Kingdo… 1 January 1956 Ismail al-Azhari
## 7 Tunisia French Prote… France 20 March 1956 Muhammad VIII a…
## 8 Morocco French Prote… France Spain 2 March 19567 A… Mohammed V
## 9 Ghana Gold Coast United Kingdom 6 March 1957 Kwame Nkrumah
## 10 Guinea French West … France 2 October 1958 Ahmed Sékou Tou…
## # … with 44 more rows, and 1 more variable: independence_won_through <chr>
as you can see, the date of independence is in a format that might make it difficult to answer questions
such as Which African countries gained independence before 1960 ? for two reasons. First of all,
the date uses the name of the month instead of the number of the month, and second of all the type of
the independence day column is character and not “date”. So our first task is to correctly define the column
as being of type date, while making sure that R understands that January is supposed to be “01”, and so
on. There are several helpful functions included in {lubridate}
to convert columns to dates. For instance
if the column you want to convert is of the form “2012-11-21”, then you would use the function ymd()
,
for “year-month-day”. If, however the column is “2012-21-11”, then you would use ydm()
. There’s
a few of these helper functions, and they can handle a lot of different formats for dates. In our case,
having the name of the month instead of the number might seem quite problematic, but it turns out
that this is a case that {lubridate}
handles painfully:
library(lubridate)
independence <- independence %>%
mutate(independence_date = dmy(independence_date))
## Warning: 5 failed to parse.
Some dates failed to parse, for instance for Morocco. This is because these countries have several independence dates; this means that the string to convert looks like:
"2 March 1956
7 April 1956
10 April 1958
4 January 1969"
which obviously cannot be converted by {lubridate}
without further manipulation. I ignore these cases for
simplicity’s sake.
4.7.2.2 Data manipulation with dates
Let’s take a look at the data now:
independence
## # A tibble: 54 x 6
## country colonial_name colonial_power independence_da… first_head_of_s…
## <chr> <chr> <chr> <date> <chr>
## 1 Liberia Liberia United States 1847-07-26 Joseph Jenkins …
## 2 South … Cape Colony … United Kingdom 1910-05-31 Louis Botha
## 3 Egypt Sultanate of… United Kingdom 1922-02-28 Fuad I
## 4 Eritrea Italian Erit… Italy 1947-02-10 Haile Selassie
## 5 Libya British Mili… United Kingdo… 1951-12-24 Idris
## 6 Sudan Anglo-Egypti… United Kingdo… 1956-01-01 Ismail al-Azhari
## 7 Tunisia French Prote… France 1956-03-20 Muhammad VIII a…
## 8 Morocco French Prote… France Spain NA Mohammed V
## 9 Ghana Gold Coast United Kingdom 1957-03-06 Kwame Nkrumah
## 10 Guinea French West … France 1958-10-02 Ahmed Sékou Tou…
## # … with 44 more rows, and 1 more variable: independence_won_through <chr>
As you can see, we now have a date column in the right format. We can now answer questions such as
Which countries gained independence before 1960? quite easily, by using the functions year()
,
month()
and day()
. Let’s see which countries gained independence before 1960:
independence %>%
filter(year(independence_date) <= 1960) %>%
pull(country)
## [1] "Liberia" "South Africa"
## [3] "Egypt" "Eritrea"
## [5] "Libya" "Sudan"
## [7] "Tunisia" "Ghana"
## [9] "Guinea" "Cameroon"
## [11] "Togo" "Mali"
## [13] "Madagascar" "Democratic Republic of the Congo"
## [15] "Benin" "Niger"
## [17] "Burkina Faso" "Ivory Coast"
## [19] "Chad" "Central African Republic"
## [21] "Republic of the Congo" "Gabon"
## [23] "Mauritania"
You guessed it, year()
extracts the year of the date column and converts it as a numeric so that we can work
on it. This is the same for month()
or day()
. Let’s try to see if countries gained their independence on
Christmas Eve:
independence %>%
filter(month(independence_date) == 12,
day(independence_date) == 24) %>%
pull(country)
## [1] "Libya"
Seems like Libya was the only one! You can also operate on dates. For instance, let’s compute the difference between
two dates, using the interval()
column:
independence %>%
mutate(today = lubridate::today()) %>%
mutate(independent_since = interval(independence_date, today)) %>%
select(country, independent_since)
## # A tibble: 54 x 2
## country independent_since
## <chr> <S4: Interval>
## 1 Liberia 1847-07-26 UTC--2019-03-20 UTC
## 2 South Africa 1910-05-31 UTC--2019-03-20 UTC
## 3 Egypt 1922-02-28 UTC--2019-03-20 UTC
## 4 Eritrea 1947-02-10 UTC--2019-03-20 UTC
## 5 Libya 1951-12-24 UTC--2019-03-20 UTC
## 6 Sudan 1956-01-01 UTC--2019-03-20 UTC
## 7 Tunisia 1956-03-20 UTC--2019-03-20 UTC
## 8 Morocco NA--NA
## 9 Ghana 1957-03-06 UTC--2019-03-20 UTC
## 10 Guinea 1958-10-02 UTC--2019-03-20 UTC
## # … with 44 more rows
The independent_since
column now contains an interval object that we can convert to years:
independence %>%
mutate(today = lubridate::today()) %>%
mutate(independent_since = interval(independence_date, today)) %>%
select(country, independent_since) %>%
mutate(years_independent = as.numeric(independent_since, "years"))
## # A tibble: 54 x 3
## country independent_since years_independent
## <chr> <S4: Interval> <dbl>
## 1 Liberia 1847-07-26 UTC--2019-03-20 UTC 172.
## 2 South Africa 1910-05-31 UTC--2019-03-20 UTC 109.
## 3 Egypt 1922-02-28 UTC--2019-03-20 UTC 97.1
## 4 Eritrea 1947-02-10 UTC--2019-03-20 UTC 72.1
## 5 Libya 1951-12-24 UTC--2019-03-20 UTC 67.2
## 6 Sudan 1956-01-01 UTC--2019-03-20 UTC 63.2
## 7 Tunisia 1956-03-20 UTC--2019-03-20 UTC 63.0
## 8 Morocco NA--NA NA
## 9 Ghana 1957-03-06 UTC--2019-03-20 UTC 62.0
## 10 Guinea 1958-10-02 UTC--2019-03-20 UTC 60.5
## # … with 44 more rows
We can now see for how long the last country to gain independence has been independent. Because the data is not tidy (in some cases, an African country was colonized by two powers, see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom:
independence %>%
filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>%
mutate(today = lubridate::today()) %>%
mutate(independent_since = interval(independence_date, today)) %>%
mutate(years_independent = as.numeric(independent_since, "years")) %>%
group_by(colonial_power) %>%
summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE))
## # A tibble: 4 x 2
## colonial_power last_colony_independent_for
## <chr> <dbl>
## 1 Belgium 56.7
## 2 France 41.7
## 3 Portugal 43.4
## 4 United Kingdom 42.7
4.7.2.3 Arithmetic with dates
Adding or substracting days to dates is quite easy:
ymd("2018-12-31") + 16
## [1] "2019-01-16"
It is also possible to be more explicit and use days()
:
ymd("2018-12-31") + days(16)
## [1] "2019-01-16"
To add years, you can use years()
:
ymd("2018-12-31") + years(1)
## [1] "2019-12-31"
But you have to be careful with leap years:
ymd("2016-02-29") + years(1)
## [1] NA
Because 2017 is not a leap year, the above computation returns NA
. The same goes for months with
a different number of days:
ymd("2018-12-31") + months(2)
## [1] NA
The way to solve these issues is to use the special %m+%
infix operator:
ymd("2016-02-29") %m+% years(1)
## [1] "2017-02-28"
and for months:
ymd("2018-12-31") %m+% months(2)
## [1] "2019-02-28"
{lubridate}
contains many more functions. If you often work with dates, duration or interval
data, {lubridate}
is a package that you have to add to your toolbox.
4.7.3 Manipulate strings with {stringr}
{stringr}
contains functions to manipulate strings. In Chapter 10, I will teach you about regular
expressions, but the functions contained in {stringr}
allow you to already do a lot of work on
strings, without needing to be a regular expression expert.
I will discuss the most common string operations: detecting, locating, matching, searching and replacing, and exctracting/removing strings.
To introduce these operations, let us use an ALTO file of an issue of The Winchester News from October 31, 1910, which you can find on this link (to see how the newspaper looked like, click here). I re-hosted the file on a public gist for archiving purposes. While working on the book, the original site went down several times…
ALTO is an XML schema for the description of text OCR and layout information of pages for digitzed material, such as newspapers (source: ALTO Wikipedia page). For more details, you can read my blogpost on the matter, but for our current purposes, it is enough to know that the file contains the text of newspaper articles. The file looks like this:
<TextLine HEIGHT="138.0" WIDTH="2434.0" HPOS="4056.0" VPOS="5814.0">
<String STYLEREFS="ID7" HEIGHT="108.0" WIDTH="393.0" HPOS="4056.0" VPOS="5838.0" CONTENT="timore" WC="0.82539684">
<ALTERNATIVE>timole</ALTERNATIVE>
<ALTERNATIVE>tlnldre</ALTERNATIVE>
<ALTERNATIVE>timor</ALTERNATIVE>
<ALTERNATIVE>insole</ALTERNATIVE>
<ALTERNATIVE>landed</ALTERNATIVE>
</String>
<SP WIDTH="74.0" HPOS="4449.0" VPOS="5838.0"/>
<String STYLEREFS="ID7" HEIGHT="105.0" WIDTH="432.0" HPOS="4524.0" VPOS="5847.0" CONTENT="market" WC="0.95238096"/>
<SP WIDTH="116.0" HPOS="4956.0" VPOS="5847.0"/>
<String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="138.0" HPOS="5073.0" VPOS="5883.0" CONTENT="as" WC="0.96825397"/>
<SP WIDTH="74.0" HPOS="5211.0" VPOS="5883.0"/>
<String STYLEREFS="ID7" HEIGHT="69.0" WIDTH="285.0" HPOS="5286.0" VPOS="5877.0" CONTENT="were" WC="1.0">
<ALTERNATIVE>verc</ALTERNATIVE>
<ALTERNATIVE>veer</ALTERNATIVE>
</String>
<SP WIDTH="68.0" HPOS="5571.0" VPOS="5877.0"/>
<String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="147.0" HPOS="5640.0" VPOS="5838.0" CONTENT="all" WC="1.0"/>
<SP WIDTH="83.0" HPOS="5787.0" VPOS="5838.0"/>
<String STYLEREFS="ID7" HEIGHT="111.0" WIDTH="183.0" HPOS="5871.0" VPOS="5835.0" CONTENT="the" WC="0.95238096">
<ALTERNATIVE>tll</ALTERNATIVE>
<ALTERNATIVE>Cu</ALTERNATIVE>
<ALTERNATIVE>tall</ALTERNATIVE>
</String>
<SP WIDTH="75.0" HPOS="6054.0" VPOS="5835.0"/>
<String STYLEREFS="ID3" HEIGHT="132.0" WIDTH="351.0" HPOS="6129.0" VPOS="5814.0" CONTENT="cattle" WC="0.95238096"/>
</TextLine>
We are interested in the strings after CONTENT=
. We are going to use functions from the {stringr}
package to get the strings after CONTENT=
. In Chapter 10, we are going to explore this file
again, but using complex regular expressions to get all the content in one go.
4.7.3.1 Getting text data into Rstudio
First of all, let us read in the file:
winchester <- read_lines("https://gist.githubusercontent.com/b-rodrigues/5139560e7d0f2ecebe5da1df3629e015/raw/e3031d894ffb97217ddbad1ade1b307c9937d2c8/gistfile1.txt")
Even though the file is an XML file, I still read it in using read_lines()
and not read_xml()
from the {xml2}
package. This is for the purposes of the current exercise, and also because I
always have trouble with XML files, and prefer to treat them as simple text files, and use regular
expressions to get what I need.
Now that the ALTO file is read in and saved in the winchester
variable, you might want to print
the whole thing in the console. Before that, take a look at the structure:
str(winchester)
## chr [1:43] "" ...
So the winchester
variable is a character atomic vector with 43 elements. So first, we need to
understand what these elements are. Let’s start with the first one:
winchester[1]
## [1] ""
Ok, so it seems like the first element is part of the header of the file. What about the second one?
winchester[2]
## [1] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"><base href=\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\"><style>body{margin-left:0;margin-right:0;margin-top:0}#bN015htcoyT__google-cache-hdr{background:#f5f5f5;font:13px arial,sans-serif;text-align:left;color:#202020;border:0;margin:0;border-bottom:1px solid #cecece;line-height:16px;padding:16px 28px 24px 28px}#bN015htcoyT__google-cache-hdr *{display:inline;font:inherit;text-align:inherit;color:inherit;line-height:inherit;background:none;border:0;margin:0;padding:0;letter-spacing:0}#bN015htcoyT__google-cache-hdr a{text-decoration:none;color:#1a0dab}#bN015htcoyT__google-cache-hdr a:hover{text-decoration:underline}#bN015htcoyT__google-cache-hdr a:visited{color:#609}#bN015htcoyT__google-cache-hdr div{display:block;margin-top:4px}#bN015htcoyT__google-cache-hdr b{font-weight:bold;display:inline-block;direction:ltr}</style><div id=\"bN015htcoyT__google-cache-hdr\"><div><span>This is Google's cache of <a href=\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\">https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml</a>.</span> <span>It is a snapshot of the page as it appeared on 21 Jan 2019 05:18:18 GMT.</span> <span>The <a href=\"https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml\">current page</a> could have changed in the meantime.</span> <a href=\"http://support.google.com/websearch/bin/answer.py?hl=en&p=cached&answer=1687222\"><span>Learn more</span>.</a></div><div><span style=\"display:inline-block;margin-top:8px;margin-right:104px;white-space:nowrap\"><span style=\"margin-right:28px\"><span style=\"font-weight:bold\">Full version</span></span><span style=\"margin-right:28px\"><a href=\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&hl=en&gl=lu&strip=1&vwsrc=0\"><span>Text-only version</span></a></span><span style=\"margin-right:28px\"><a href=\"http://webcache.googleusercontent.com/search?q=cache:2BVPV8QGj3oJ:https://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml&hl=en&gl=lu&strip=0&vwsrc=1\"><span>View source</span></a></span></span></div><span style=\"display:inline-block;margin-top:8px;color:#717171\"><span>Tip: To quickly find your search term on this page, press <b>Ctrl+F</b> or <b>⌘-F</b> (Mac) and use the find bar.</span></span></div><div style=\"position:relative;\"><?xml version=\"1.0\" encoding=\"UTF-8\"?>"
Same. So where is the content? The file is very large, so if you print it in the console, it will
take quite some time to print, and you will not really be able to make out anything. The best
way would be to try to detect the string CONTENT
and work from there.
4.7.3.2 Detecting, getting the position and locating strings
When confronted to an atomic vector of strings, you might want to know inside which elements you
can find certain strings. For example, to know which elements of winchester
contain the string
CONTENT
, use str_detect()
:
winchester %>%
str_detect("CONTENT")
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
This returns a boolean atomic vector of the same length as winchester
. If the string CONTENT
is
nowhere to be found, the result will equal FALSE
, if not it will equal TRUE
. Here it is easy to
see that the last element contains the string CONTENT
. But what if instead of having 43 elements,
the vector had 24192 elements? And hundreds would contain the string CONTENT
? It would be easier
to instead have the indices of the vector where one can find the word CONTENT
. This is possible
with str_which()
:
winchester %>%
str_which("CONTENT")
## [1] 43
Here, the result is 43, meaning that the 43rd element of winchester
contains the string CONTENT
somewhere. If we need more precision, we can use str_locate()
and str_locate_all()
. To explain
how both these functions work, let’s create a very small example:
ancient_philosophers <- c("aristotle", "plato", "epictetus", "seneca the younger", "epicurus", "marcus aurelius")
Now suppose I am interested in philosophers whose name ends in us
. Let us use str_locate()
first:
ancient_philosophers %>%
str_locate("us")
## start end
## [1,] NA NA
## [2,] NA NA
## [3,] 8 9
## [4,] NA NA
## [5,] 7 8
## [6,] 5 6
You can interpret the result as follows: in the rows, the index of the vector where the
string us
is found. So the 3rd, 5th and 6th philosopher have us
somewhere in their name.
The result also has two columns: start
and end
. These give the position of the string. So the
string us
can be found starting at position 8 of the 3rd element of the vector, and ends at position
9. Same goes for the other philisophers. However, consider Marcus Aurelius. He has two names, both
ending with us
. However, str_locate()
only shows the position of the us
in Marcus
.
To get both us
strings, you need to use str_locate_all()
:
ancient_philosophers %>%
str_locate_all("us")
## [[1]]
## start end
##
## [[2]]
## start end
##
## [[3]]
## start end
## [1,] 8 9
##
## [[4]]
## start end
##
## [[5]]
## start end
## [1,] 7 8
##
## [[6]]
## start end
## [1,] 5 6
## [2,] 14 15
Now we get the position of the two us
in Marcus Aurelius. Doing this on the winchester
vector
will give use the position of the CONTENT
string, but this is not really important right now. What
matters is that you know how str_locate()
and str_locate_all()
work.
So now that we know what interests us in the 43nd element of winchester
, let’s take a closer
look at it:
winchester[43]
As you can see, it’s a mess:
<TextLine HEIGHT=\"126.0\" WIDTH=\"1731.0\" HPOS=\"17160.0\" VPOS=\"21252.0\"><String HEIGHT=\"114.0\" WIDTH=\"354.0\" HPOS=\"17160.0\" VPOS=\"21264.0\" CONTENT=\"0tV\" WC=\"0.8095238\"/><SP WIDTH=\"131.0\" HPOS=\"17514.0\" VPOS=\"21264.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"111.0\" WIDTH=\"474.0\" HPOS=\"17646.0\" VPOS=\"21258.0\" CONTENT=\"BATES\" WC=\"1.0\"/><SP WIDTH=\"140.0\" HPOS=\"18120.0\" VPOS=\"21258.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"114.0\" WIDTH=\"630.0\" HPOS=\"18261.0\" VPOS=\"21252.0\" CONTENT=\"President\" WC=\"1.0\"><ALTERNATIVE>Prcideht</ALTERNATIVE><ALTERNATIVE>Pride</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\"153.0\" WIDTH=\"1689.0\" HPOS=\"17145.0\" VPOS=\"21417.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"105.0\" WIDTH=\"258.0\" HPOS=\"17145.0\" VPOS=\"21439.0\" CONTENT=\"WM\" WC=\"0.82539684\"><TextLine HEIGHT=\"120.0\" WIDTH=\"2211.0\" HPOS=\"16788.0\" VPOS=\"21870.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"102.0\" HPOS=\"16788.0\" VPOS=\"21894.0\" CONTENT=\"It\" WC=\"1.0\"/><SP WIDTH=\"72.0\" HPOS=\"16890.0\" VPOS=\"21894.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"93.0\" HPOS=\"16962.0\" VPOS=\"21885.0\" CONTENT=\"is\" WC=\"1.0\"/><SP WIDTH=\"80.0\" HPOS=\"17055.0\" VPOS=\"21885.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"102.0\" WIDTH=\"417.0\" HPOS=\"17136.0\" VPOS=\"21879.0\" CONTENT=\"seldom\" WC=\"1.0\"/><SP WIDTH=\"80.0\" HPOS=\"17553.0\" VPOS=\"21879.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"267.0\" HPOS=\"17634.0\" VPOS=\"21873.0\" CONTENT=\"hard\" WC=\"1.0\"/><SP WIDTH=\"81.0\" HPOS=\"17901.0\" VPOS=\"21873.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"87.0\" WIDTH=\"111.0\" HPOS=\"17982.0\" VPOS=\"21879.0\" CONTENT=\"to\" WC=\"1.0\"/><SP WIDTH=\"81.0\" HPOS=\"18093.0\" VPOS=\"21879.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"96.0\" WIDTH=\"219.0\" HPOS=\"18174.0\" VPOS=\"21870.0\" CONTENT=\"find\" WC=\"1.0\"/><SP WIDTH=\"77.0\" HPOS=\"18393.0\" VPOS=\"21870.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"69.0\" WIDTH=\"66.0\" HPOS=\"18471.0\" VPOS=\"21894.0\" CONTENT=\"a\" WC=\"1.0\"/><SP WIDTH=\"77.0\" HPOS=\"18537.0\" VPOS=\"21894.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"78.0\" WIDTH=\"384.0\" HPOS=\"18615.0\" VPOS=\"21888.0\" CONTENT=\"succes\" WC=\"0.82539684\"><ALTERNATIVE>success</ALTERNATIVE></String></TextLine><TextLine HEIGHT=\"126.0\" WIDTH=\"2316.0\" HPOS=\"16662.0\" VPOS=\"22008.0\"><String STYLEREFS=\"ID7\" HEIGHT=\"75.0\" WIDTH=\"183.0\" HPOS=\"16662.0\" VPOS=\"22059.0\" CONTENT=\"sor\" WC=\"1.0\"><ALTERNATIVE>soar</ALTERNATIVE></String><SP WIDTH=\"72.0\" HPOS=\"16845.0\" VPOS=\"22059.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"90.0\" WIDTH=\"168.0\" HPOS=\"16917.0\" VPOS=\"22035.0\" CONTENT=\"for\" WC=\"1.0\"/><SP WIDTH=\"72.0\" HPOS=\"17085.0\" VPOS=\"22035.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"69.0\" WIDTH=\"267.0\" HPOS=\"17157.0\" VPOS=\"22050.0\" CONTENT=\"even\" WC=\"1.0\"><ALTERNATIVE>cen</ALTERNATIVE><ALTERNATIVE>cent</ALTERNATIVE></String><SP WIDTH=\"77.0\" HPOS=\"17434.0\" VPOS=\"22050.0\"/><String STYLEREFS=\"ID7\" HEIGHT=\"66.0\" WIDTH=\"63.0\" HPOS=\"17502.0\" VPOS=\"22044.0\"
The file was imported without any newlines. So we need to insert them ourselves, by splitting the string in a clever way.
4.7.3.3 Splitting strings
There are two functions included in {stringr}
to split strings, str_split()
and str_split_fixed()
.
Let’s go back to our ancient philosophers. Two of them, Seneca the Younger and Marcus Aurelius have
something else in common than both being Roman Stoic philosophers. Their names are composed of several
words. If we want to split their names at the space character, we can use str_split()
like this:
ancient_philosophers %>%
str_split(" ")
## [[1]]
## [1] "aristotle"
##
## [[2]]
## [1] "plato"
##
## [[3]]
## [1] "epictetus"
##
## [[4]]
## [1] "seneca" "the" "younger"
##
## [[5]]
## [1] "epicurus"
##
## [[6]]
## [1] "marcus" "aurelius"
str_split()
also has a simplify = TRUE
option:
ancient_philosophers %>%
str_split(" ", simplify = TRUE)
## [,1] [,2] [,3]
## [1,] "aristotle" "" ""
## [2,] "plato" "" ""
## [3,] "epictetus" "" ""
## [4,] "seneca" "the" "younger"
## [5,] "epicurus" "" ""
## [6,] "marcus" "aurelius" ""
This time, the returned object is a matrix.
What about str_split_fixed()
? The difference is that here you can specify the number of pieces
to return. For example, you could consider the name “Aurelius” to be the middle name of Marcus Aurelius,
and the “the younger” to be the middle name of Seneca the younger. This means that you would want
to split the name only at the first space character, and not at all of them. This is easily achieved
with str_split_fixed()
:
ancient_philosophers %>%
str_split_fixed(" ", 2)
## [,1] [,2]
## [1,] "aristotle" ""
## [2,] "plato" ""
## [3,] "epictetus" ""
## [4,] "seneca" "the younger"
## [5,] "epicurus" ""
## [6,] "marcus" "aurelius"
This gives the expected result.
So how does this help in our case? Well, if you look at how the ALTO file looks like, at the beginning of this section, you will notice that every line ends with the “>” character. So let’s split at that character!
winchester_text <- winchester[43] %>%
str_split(">")
Let’s take a closer look at winchester_text
:
str(winchester_text)
## List of 1
## $ : chr [1:19706] "</processingStepSettings" "<processingSoftware" "<softwareCreator" "iArchives</softwareCreator" ...
So this is a list of length one, and the first, and only, element of that list is an atomic vector with 19706 elements. Since this is a list of only one element, we can simplify it by saving the atomic vector in a variable:
winchester_text <- winchester_text[[1]]
Let’s now look at some lines:
winchester_text[1232:1245]
## [1] "<SP WIDTH=\"66.0\" HPOS=\"5763.0\" VPOS=\"9696.0\"/"
## [2] "<String STYLEREFS=\"ID7\" HEIGHT=\"108.0\" WIDTH=\"612.0\" HPOS=\"5829.0\" VPOS=\"9693.0\" CONTENT=\"Louisville\" WC=\"1.0\""
## [3] "<ALTERNATIVE"
## [4] "Loniile</ALTERNATIVE"
## [5] "<ALTERNATIVE"
## [6] "Lenities</ALTERNATIVE"
## [7] "</String"
## [8] "</TextLine"
## [9] "<TextLine HEIGHT=\"150.0\" WIDTH=\"2520.0\" HPOS=\"4032.0\" VPOS=\"9849.0\""
## [10] "<String STYLEREFS=\"ID7\" HEIGHT=\"108.0\" WIDTH=\"510.0\" HPOS=\"4032.0\" VPOS=\"9861.0\" CONTENT=\"Tobacco\" WC=\"1.0\"/"
## [11] "<SP WIDTH=\"113.0\" HPOS=\"4542.0\" VPOS=\"9861.0\"/"
## [12] "<String STYLEREFS=\"ID7\" HEIGHT=\"105.0\" WIDTH=\"696.0\" HPOS=\"4656.0\" VPOS=\"9861.0\" CONTENT=\"Warehouse\" WC=\"1.0\""
## [13] "<ALTERNATIVE"
## [14] "WHrchons</ALTERNATIVE"
This now looks easier to handle. We can narrow it down to the lines that only contain the string we are interested in, “CONTENT”. First, let’s get the indices:
content_winchester_index <- winchester_text %>%
str_which("CONTENT")
How many lines contain the string “CONTENT”?
length(content_winchester_index)
## [1] 4462
As you can see, this reduces the amount of data we have to work with. Let us save this is a new variable:
content_winchester <- winchester_text[content_winchester_index]
4.7.3.4 Matching strings
Matching strings is useful, but only in combination with regular expressions. As stated at the
beginning of this section, we are going to learn about regular expressions in Chapter 10, but in
order to make this section useful, we are going to learn the easiest, but perhaps the most useful
regular expression: .*
.
Let’s go back to our ancient philosophers, and use str_match()
and see what happens. Let’s match
the “us” string:
ancient_philosophers %>%
str_match("us")
## [,1]
## [1,] NA
## [2,] NA
## [3,] "us"
## [4,] NA
## [5,] "us"
## [6,] "us"
Not very useful, but what about the regular expression .*
? How could it help?
ancient_philosophers %>%
str_match(".*us")
## [,1]
## [1,] NA
## [2,] NA
## [3,] "epictetus"
## [4,] NA
## [5,] "epicurus"
## [6,] "marcus aurelius"
That’s already very interesting! So how does .*
work? To understand, let’s first start by using
.
alone:
ancient_philosophers %>%
str_match(".us")
## [,1]
## [1,] NA
## [2,] NA
## [3,] "tus"
## [4,] NA
## [5,] "rus"
## [6,] "cus"
This also matched whatever symbol comes just before the “u” from “us”. What if we use two .
instead?
ancient_philosophers %>%
str_match("..us")
## [,1]
## [1,] NA
## [2,] NA
## [3,] "etus"
## [4,] NA
## [5,] "urus"
## [6,] "rcus"
This time, we get the two symbols that immediately precede “us”. Instead of continuing like this
we now use the *
, which matches zero or more of .
. So by combining *
and .
, we can match
any symbol repeatedly, until there is nothing more to match. Note that there is also +
, which works
similarly to *
, but it matches one or more symbols.
There is also a str_match_all()
:
ancient_philosophers %>%
str_match_all(".*us")
## [[1]]
## [,1]
##
## [[2]]
## [,1]
##
## [[3]]
## [,1]
## [1,] "epictetus"
##
## [[4]]
## [,1]
##
## [[5]]
## [,1]
## [1,] "epicurus"
##
## [[6]]
## [,1]
## [1,] "marcus aurelius"
In this particular case it does not change the end result, but keep it in mind for cases like this one:
c("haha", "huhu") %>%
str_match("ha")
## [,1]
## [1,] "ha"
## [2,] NA
and:
c("haha", "huhu") %>%
str_match_all("ha")
## [[1]]
## [,1]
## [1,] "ha"
## [2,] "ha"
##
## [[2]]
## [,1]
What if we want to match names containing the letter “t”? Easy:
ancient_philosophers %>%
str_match(".*t.*")
## [,1]
## [1,] "aristotle"
## [2,] "plato"
## [3,] "epictetus"
## [4,] "seneca the younger"
## [5,] NA
## [6,] NA
So how does this help us with our historical newspaper? Let’s try to get the strings that come after “CONTENT”:
winchester_content <- winchester_text %>%
str_match("CONTENT.*")
Let’s use our faithful str()
function to take a look:
winchester_content %>%
str
## chr [1:19706, 1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
Hum, there’s a lot of NA
values! This is because a lot of the lines from the file did not have the
string “CONTENT”, so there is no match possible. Let’s us remove all these NA
s. Because the
result is a matrix, we cannot use the filter()
function from {dplyr}
. So we need to convert it
to a tibble first:
winchester_content <- winchester_content %>%
as.tibble() %>%
filter(!is.na(V1))
Because matrix columns do not have names, when a matrix gets converted into a tibble, the firt column
gets automatically called V1
. This is why I filter on this column. Let’s take a look at the data:
head(winchester_content)
## # A tibble: 6 x 1
## V1
## <chr>
## 1 "CONTENT=\"J\" WC=\"0.8095238\"/"
## 2 "CONTENT=\"a\" WC=\"0.8095238\"/"
## 3 "CONTENT=\"Ira\" WC=\"0.95238096\"/"
## 4 "CONTENT=\"mj\" WC=\"0.8095238\"/"
## 5 "CONTENT=\"iI\" WC=\"0.8095238\"/"
## 6 "CONTENT=\"tE1r\" WC=\"0.8095238\"/"
4.7.3.5 Searching and replacing strings
We are getting close to the final result. We still need to do some cleaning however. Since our data is inside a nice tibble, we might as well stick with it. So let’s first rename the column and change all the strings to lowercase:
winchester_content <- winchester_content %>%
mutate(content = tolower(V1)) %>%
select(-V1)
Let’s take a look at the result:
head(winchester_content)
## # A tibble: 6 x 1
## content
## <chr>
## 1 "content=\"j\" wc=\"0.8095238\"/"
## 2 "content=\"a\" wc=\"0.8095238\"/"
## 3 "content=\"ira\" wc=\"0.95238096\"/"
## 4 "content=\"mj\" wc=\"0.8095238\"/"
## 5 "content=\"ii\" wc=\"0.8095238\"/"
## 6 "content=\"te1r\" wc=\"0.8095238\"/"
The second part of the string, “wc=….” is not really interesting. Let’s search and replace this
with an empty string, using str_replace()
:
winchester_content <- winchester_content %>%
mutate(content = str_replace(content, "wc.*", ""))
head(winchester_content)
## # A tibble: 6 x 1
## content
## <chr>
## 1 "content=\"j\" "
## 2 "content=\"a\" "
## 3 "content=\"ira\" "
## 4 "content=\"mj\" "
## 5 "content=\"ii\" "
## 6 "content=\"te1r\" "
We need to use the regular expression from before to replace “wc” and every character that follows. The same can be use to remove “content=”:
winchester_content <- winchester_content %>%
mutate(content = str_replace(content, "content=", ""))
head(winchester_content)
## # A tibble: 6 x 1
## content
## <chr>
## 1 "\"j\" "
## 2 "\"a\" "
## 3 "\"ira\" "
## 4 "\"mj\" "
## 5 "\"ii\" "
## 6 "\"te1r\" "
We are almost done, but some cleaning is still necessary:
4.7.3.6 Exctracting or removing strings
Now, because I now the ALTO spec, I know how to find words that are split between two sentences:
winchester_content %>%
filter(str_detect(content, "hyppart"))
## # A tibble: 64 x 1
## content
## <chr>
## 1 "\"aver\" subs_type=\"hyppart1\" subs_content=\"average\" "
## 2 "\"age\" subs_type=\"hyppart2\" subs_content=\"average\" "
## 3 "\"considera\" subs_type=\"hyppart1\" subs_content=\"consideration\" "
## 4 "\"tion\" subs_type=\"hyppart2\" subs_content=\"consideration\" "
## 5 "\"re\" subs_type=\"hyppart1\" subs_content=\"resigned\" "
## 6 "\"signed\" subs_type=\"hyppart2\" subs_content=\"resigned\" "
## 7 "\"install\" subs_type=\"hyppart1\" subs_content=\"installed\" "
## 8 "\"ed\" subs_type=\"hyppart2\" subs_content=\"installed\" "
## 9 "\"be\" subs_type=\"hyppart1\" subs_content=\"before\" "
## 10 "\"fore\" subs_type=\"hyppart2\" subs_content=\"before\" "
## # … with 54 more rows
For instance, the word “average” was split over two lines, the first part of the word, “aver” on the
first line, and the second part of the word, “age”, on the second line. We want to keep what comes
after “subs_content”. Let’s extract the word “average” using str_extract()
. However, because only
some words were split between two lines, we first need to detect where the string “hyppart1” is
located, and only then can we extract what comes after “subs_content”. Thus, we need to combine
str_detect()
to first detect the string, and then str_extract()
to extract what comes after
“subs_content”:
winchester_content <- winchester_content %>%
mutate(content = if_else(str_detect(content, "hyppart1"),
str_extract_all(content, "content=.*", simplify = TRUE),
content))
Let’s take a look at the result:
winchester_content %>%
filter(str_detect(content, "content"))
## # A tibble: 64 x 1
## content
## <chr>
## 1 "content=\"average\" "
## 2 "\"age\" subs_type=\"hyppart2\" subs_content=\"average\" "
## 3 "content=\"consideration\" "
## 4 "\"tion\" subs_type=\"hyppart2\" subs_content=\"consideration\" "
## 5 "content=\"resigned\" "
## 6 "\"signed\" subs_type=\"hyppart2\" subs_content=\"resigned\" "
## 7 "content=\"installed\" "
## 8 "\"ed\" subs_type=\"hyppart2\" subs_content=\"installed\" "
## 9 "content=\"before\" "
## 10 "\"fore\" subs_type=\"hyppart2\" subs_content=\"before\" "
## # … with 54 more rows
We still need to get rid of the string “content=” and then of all the strings that contain “hyppart2”, which are not needed now:
winchester_content <- winchester_content %>%
mutate(content = str_replace(content, "content=", "")) %>%
mutate(content = if_else(str_detect(content, "hyppart2"), NA_character_, content))
head(winchester_content)
## # A tibble: 6 x 1
## content
## <chr>
## 1 "\"j\" "
## 2 "\"a\" "
## 3 "\"ira\" "
## 4 "\"mj\" "
## 5 "\"ii\" "
## 6 "\"te1r\" "
Almost done! We only need to remove the "
characters:
winchester_content <- winchester_content %>%
mutate(content = str_replace_all(content, "\"", ""))
head(winchester_content)
## # A tibble: 6 x 1
## content
## <chr>
## 1 "j "
## 2 "a "
## 3 "ira "
## 4 "mj "
## 5 "ii "
## 6 "te1r "
Let’s remove space characters with str_trim()
:
winchester_content <- winchester_content %>%
mutate(content = str_trim(content))
head(winchester_content)
## # A tibble: 6 x 1
## content
## <chr>
## 1 j
## 2 a
## 3 ira
## 4 mj
## 5 ii
## 6 te1r
To finish off this section, let’s remove stop words (words that do not add any meaning to a sentence,
such as “as”, “and”…) and words that are composed of less than 3 characters. You can find a dataset
with stopwords inside the {stopwords}
package:
library(stopwords)
data(data_stopwords_stopwordsiso)
eng_stopwords <- tibble("content" = data_stopwords_stopwordsiso$en)
winchester_content <- winchester_content %>%
anti_join(eng_stopwords) %>%
filter(nchar(content) > 3)
## Joining, by = "content"
head(winchester_content)
## # A tibble: 6 x 1
## content
## <chr>
## 1 te1r
## 2 jilas
## 3 edition
## 4 winchester
## 5 news
## 6 injuries
That’s it for this section! You now know how to work with strings, but in Chapter 10 we are going one step further by learning about regular expressions, which offer much more power.
4.7.4 Tidy data frames with {tibble}
We have already seen and used several functions from the {tibble}
package. Let’s now go through
some more useful functions.
4.7.4.1 Creating tibbles
tribble()
makes it easy to create tibble row by row, manually:
It is also possible to create a tibble from a named list:
as_tibble(list("combustion" = c("oil", "diesel", "oil", "electric"),
"doors" = c(3, 5, 5, 5)))
## # A tibble: 4 x 2
## combustion doors
## <chr> <dbl>
## 1 oil 3
## 2 diesel 5
## 3 oil 5
## 4 electric 5
enframe(list("combustion" = c(1,2), "doors" = c(1,2,4), "cylinders" = c(1,8,9,10)))
## # A tibble: 3 x 2
## name value
## <chr> <list>
## 1 combustion <dbl [2]>
## 2 doors <dbl [3]>
## 3 cylinders <dbl [4]>
4.8 List-columns
To learn about list-columns, let’s first focus on a single character of the starwars
dataset:
data(starwars)
starwars %>%
filter(name == "Luke Skywalker") %>%
glimpse()
## Observations: 1
## Variables: 13
## $ name <chr> "Luke Skywalker"
## $ height <int> 172
## $ mass <dbl> 77
## $ hair_color <chr> "blond"
## $ skin_color <chr> "fair"
## $ eye_color <chr> "blue"
## $ birth_year <dbl> 19
## $ gender <chr> "male"
## $ homeworld <chr> "Tatooine"
## $ species <chr> "Human"
## $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "The …
## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">]
## $ starships <list> [<"X-wing", "Imperial shuttle">]
We see that the columns films
, vehicles
and starships
are all lists, and in the case of
films
, it lists all the films where Luke Skywalker has appeared. What if you want to take a closer look at this list?
starwars %>%
filter(name == "Luke Skywalker") %>%
pull(films)
## [[1]]
## [1] "Revenge of the Sith" "Return of the Jedi"
## [3] "The Empire Strikes Back" "A New Hope"
## [5] "The Force Awakens"
pull()
is a dplyr
function that extract (pulls) the column you’re interested in. It is quite
useful when you want to inspect a column.
Suppose we want to create a categorical variable which counts the number of movies in which the
characters have appeared. For this we need to compute the length of the list, or count the number
of elements this list has. Let’s try with length()
a base R function:
starwars %>%
filter(name == "Luke Skywalker") %>%
pull(films) %>%
length()
## [1] 1
This might be surprising at first, because we know that Luke Skywalker has appeared in more than 1
movie… the problem here is that for each individual, films
is a list, whose single element is
a vector of characters. This means that length(films)
computes the length of the list, which is
one, and not the length of the vector contained in the list! How can we get the length of the
vector of characters contained in the list and for each character? For this we need to use
dplyr::rowwise()
and remove the filter()
function and use mutate()
to add this column to the
dataset:
starwars <- starwars %>%
rowwise() %>%
mutate(n_films = length(films))
dplyr::rowwise()
is useful when working with list-columns: columns that have lists as elements.
Let’s take a look at the characters and the number of films they have appeared in:
starwars %>%
select(name, n_films)
## Source: local data frame [87 x 2]
## Groups: <by row>
##
## # A tibble: 87 x 2
## name n_films
## <chr> <int>
## 1 Luke Skywalker 5
## 2 C-3PO 6
## 3 R2-D2 7
## 4 Darth Vader 4
## 5 Leia Organa 5
## 6 Owen Lars 3
## 7 Beru Whitesun lars 3
## 8 R5-D4 1
## 9 Biggs Darklighter 1
## 10 Obi-Wan Kenobi 6
## # … with 77 more rows
Now we can create a factor variable that groups characters by asking whether they appeared only in 1 movie, or more:
starwars <- starwars %>%
mutate(more_1 = case_when(n_films == 1 ~ "Exactly one movie",
n_films != 1 ~ "More than 1 movie"))
case_when()
is a dplyr
function that works similarly to the standard if..else..
construct of
many programming languages (R also has this, we are going to learn about it in later chapters).
You can also create list columns with your own datasets, by using tidyr::nest()
. Remember the
fake survey_data
I created to illustrate spread()
and gather()
? Let’s go back to that dataset
again:
print(survey_data)
## # A tibble: 12 x 3
## id variable value
## <dbl> <chr> <dbl>
## 1 1 var1 1
## 2 1 var2 0.2
## 3 NA var3 0.3
## 4 2 var1 1.4
## 5 2 var2 1.9
## 6 2 var3 4.1
## 7 3 var1 0.1
## 8 3 var2 2.8
## 9 3 var3 8.9
## 10 4 var1 1.7
## 11 NA var2 1.9
## 12 4 var3 7.6
nested_data = survey_data %>%
nest(variable, value)
print(nested_data)
## # A tibble: 5 x 2
## id data
## <dbl> <list>
## 1 1 <tibble [2 × 2]>
## 2 NA <tibble [2 × 2]>
## 3 2 <tibble [3 × 2]>
## 4 3 <tibble [3 × 2]>
## 5 4 <tibble [2 × 2]>
This creates a new tibble, with columns id
and data
. data
is a list-column that contains
tibbles; each tibble is the variable
and value
for each individual:
nested_data %>%
filter(id == "1") %>%
pull(data)
## [[1]]
## # A tibble: 2 x 2
## variable value
## <chr> <dbl>
## 1 var1 1
## 2 var2 0.2
As you can see, for individual 1, the column data contains a 2x2 tibble with columns variable
and
value
. You might be wondering why this is useful, because this seems to introduce an unnecessary
layer of complexity. The usefulness of list-columns will become apparent in the next chapters,
where we are going to learn how to repeat actions over, say, individuals.
4.9 Going beyond descriptive statistics and data manipulation
The {tidyverse}
collection of packages can do much more than simply data manipulation and
descriptive statisics. You can use the principles we have covered and the functions you now know
to do much more. For instance, you can use a few {tidyverse}
functions to do Monte Carlo simulations,
for example to estimate \(\pi\).
Draw the unit circle inside the unit square, the ratio of the area of the circle to the area of the square will be \(\pi/4\). Then shot K arrows at the square; roughly \(K*\pi/4\) should have fallen inside the circle. So if now you shoot N arrows at the square, and M fall inside the circle, you have the following relationship \(M = N*\pi/4\). You can thus compute \(\pi\) like so: \(\pi = 4*M/N\).
The more arrows N you throw at the square, the better approximation of \(\pi\) you’ll have. Let’s try to do this with a tidy Monte Carlo simulation. First, let’s randomly pick some points inside the unit square:
library(tidyverse)
n <- 5000
set.seed(2019)
points <- tibble("x" = runif(n), "y" = runif(n))
Now, to know if a point is inside the unit circle, we need to check wether \(x^2 + y^2 < 1\). Let’s
add a new column to the points
tibble, called inside
equal to 1 if the point is inside the
unit circle and 0 if not:
points <- points %>%
mutate(inside = map2_dbl(.x = x, .y = y, ~ifelse(.x**2 + .y**2 < 1, 1, 0))) %>%
rowid_to_column("N")
Let’s take a look at points
:
points
## # A tibble: 5,000 x 4
## N x y inside
## <int> <dbl> <dbl> <dbl>
## 1 1 0.770 0.984 0
## 2 2 0.713 0.0107 1
## 3 3 0.303 0.133 1
## 4 4 0.618 0.0378 1
## 5 5 0.0505 0.677 1
## 6 6 0.0432 0.0846 1
## 7 7 0.820 0.727 0
## 8 8 0.00961 0.0758 1
## 9 9 0.102 0.373 1
## 10 10 0.609 0.676 1
## # … with 4,990 more rows
Now, I can compute the estimation
of \(\pi\) at each row, by computing the cumulative sum of the 1’s in the inside
column and dividing
that by the current value of N
column:
points <- points %>%
mutate(estimate = 4*cumsum(inside)/N)
cumsum(inside)
is the M
from the formula. Now, we can finish by plotting the result:
ggplot(points) +
geom_line(aes(y = estimate, x = N)) +
geom_hline(yintercept = pi)
In the next chapter, we are going to learn all about {ggplot2}
, the package I used in the lines
above to create this plot.
As the number of tries grows, the estimation of \(\pi\) gets better.
Using a data frame as a structure to hold our simulated points and the results makes it very easy to avoid loops, and thus write code that is more concise and easier to follow. If you studied a quantitative field in university, you might have done a similar exercise at the time, very likely by defining a matrix to hold your points, and an empty vector to hold whether a particular point was inside the unit circle. Then you wrote a loop to compute whether a point was inside the unit circle, save this result in the before-defined empty vector and then compute the estimation of \(\pi\). Again, I take this opportunity here to stress that there is nothing wrong with this approach per se, but R is better suited for a workflow where lists or data frames are the central objects and where the analyst operates over them with functional programming techniques.
4.10 Exercises
Exercise 1
Load the LaborSupply
dataset from the Ecdat
package and answer the following questions:
- Compute the average annual hours worked by year (plus standard deviation)
- What age group worked the most hours in the year 1982?
- Create a variable,
n_years
that equals the number of years an individual stays in the panel. Is the panel balanced? - Which are the individuals that do not have any kids during the whole period? Create a variable,
no_kids
, that flags these individuals (1 = no kids, 0 = kids) - Using the
no_kids
variable from before compute the average wage, standard deviation and number of observations in each group for the year 1980 (no kids group vs kids group). - Create the lagged logarithm of hours worked and wages. Remember that this is a panel.
Exercise 2
- What does the following code do? Copy and paste it in an R interpreter to find out!
LaborSupply %>%
group_by(id) %>%
mutate_at(vars(starts_with("l")), funs(lag, lead))
mutate_at()
is a scoped version of mutate()
which allows you to specify a number of columns and
functions in one go. This also exists for summarise()
.
- Using
summarise_at()
, compute the mean, standard deviation and number of individuals oflnhr
andlnwg
for each individual.
Exercise 3
In the dataset folder you downloaded at the beginning of the chapter, there is a folder called “unemployment”. I used the data in the section about working with lists of datasets. Using
rio::import_list()
, read the 4 datasets into R.Using
map()
, map thejanitor::clean_names()
function to each dataset (just like in the example in the section on working with lists of datasets). Then, still withmap()
andmutate()
convert all commune names in thecommune
column with the functiontolower()
, in a new column calledlcommune
. This is not an easy exercise; so here are some hints:- Remember that
all_datasets
is a list of datasets. Which function do you use when you want to map a function to each element of a list? - Each element of
all_datasets
aredata.frame
objects. Which function do you use to add a column to adata.frame
? - What symbol can you use to access a column of a
data.frame
?
- Remember that
References
Vaudor, Lise. 2018. “Itération de Fonctions Avec Purrr.” R-Atique: Analyse de Données Avec R. http://perso.ens-lyon.fr/lise.vaudor/iterer-des-fonctions-avec-purrr/.