Chapter 8 Functional programming
By now, you are pretty familiar with functions in R. Now, let’s introduce the functional programming paradigm, but first, let’s go over some formal definitions.
8.1 Functions definition
You should be familiar with function definitions in R. For example, suppose you want to compute the square root of a number and want to do so using Newton’s algorithm:
sqrt_newton <- function(a, init, eps = 0.01){
while(abs(init**2 - a) > eps){
init <- 1/2 *(init + a/init)
}
return(init)
}
You can then use this function to get the square root of a number:
sqrt_newton(16, 2)
## [1] 4.00122
We are using a while
loop inside the body. The body of a function are the instructions that
define the function. You can get the body of a function with body(some_func)
of the function.
In pure functional programming languages, like Haskell, you don’t have loops. How can you
program without loops, you may ask? In functional programming, loops are replaced by recursion,
which we already discussed in the previous chapter. Let’s rewrite our little example above
with recursion:
sqrt_newton_recur <- function(a, init, eps = 0.01){
if(abs(init**2 - a) < eps){
result <- init
} else {
init <- 1/2 * (init + a/init)
result <- sqrt_newton_recur(a, init, eps)
}
return(result)
}
sqrt_newton_recur(16, 2)
## [1] 4.00122
R is not a pure functional programming language though, so we can still use loops (be it while
or
for
loops) in the bodies of our functions. As discussed in the previous chapter, it is actually
better, performance-wise, to use loops instead of recursion, because R is not tail-call optimized.
I won’t got into the details of what tail-call optimization is but just remember that if
performance is important a loop will be faster. However, sometimes, it is easier to write a
function using recursion. I personally tend to avoid loops if performance is not important,
because I find that code that avoids loops is easier to read and debug. However, knowing that
you can use loops is reassuring. In the coming sections I will show you some built-in functions
that make it possible to avoid writing loops and that don’t rely on recursion, so performance
won’t be penalized.
8.2 Properties of functions
Mathematical functions have a nice property: we always get the same output for a given input. This is called referential transparency and we should aim to write our R functions in such a way.
For example, the following function:
increment <- function(x){
return(x + 1)
}
Is a referential transparent function. We always get the same result for any x
that we give to
this function.
This:
increment(10)
## [1] 11
will always produce 11
.
However, this one:
increment_opaque <- function(x){
return(x + spam)
}
is not a referential transparent function, because its value depends on the global variable spam
.
spam <- 1
increment_opaque(10)
## [1] 11
will only produce 11
if spam = 1
. But what if spam = 19
?
spam <- 19
increment_opaque(10)
## [1] 29
To make increment_opaque()
a referential transparent function, it is enough to make spam
an
argument:
increment_not_opaque <- function(x, spam){
return(x + spam)
}
Now even if there is a global variable called spam
, this will not influence our function:
spam <- 19
increment_not_opaque(10, 34)
## [1] 44
This is because the variable spam
defined in the body of the function is a local variable. It
could have been called anything else, really. Avoiding opaque functions makes our life easier.
Another property that adepts of functional programming value is that functions should have no, or very limited, side-effects. This means that functions should not change the state of your program.
For example this function (which is not a referential transparent function):
count_iter <- 0
sqrt_newton_side_effect <- function(a, init, eps = 0.01){
while(abs(init**2 - a) > eps){
init <- 1/2 *(init + a/init)
count_iter <<- count_iter + 1 # The "<<-" symbol means that we assign the
} # RHS value in a variable in the global environment
return(init)
}
If you look in the environment pane, you will see that count_iter
equals 0. Now call this
function with the following arguments:
sqrt_newton_side_effect(16000, 2)
## [1] 126.4911
print(count_iter)
## [1] 9
If you check the value of count_iter
now, you will see that it increased! This is a side effect,
because the function changed something outside of its scope. It changed a value in the global
environment. In general, it is good practice to avoid side-effects. For example, we could make the
above function not have any side effects like this:
sqrt_newton_count <- function(a, init, count_iter = 0, eps = 0.01){
while(abs(init**2 - a) > eps){
init <- 1/2 *(init + a/init)
count_iter <- count_iter + 1
}
return(c(init, count_iter))
}
Now, this function returns a list with two elements, the result, and the number of iterations it took to get the result:
sqrt_newton_count(16000, 2)
## [1] 126.4911 9.0000
Writing to disk is also considered a side effect, because the function changes something (a file) outside its scope. But this cannot be avoided since you want to write to disk. Just remember: try to avoid having functions changing variables in the global environment unless you have a very good reason of doing so.
Finally, another property of mathematical functions, is that they do one single thing. Functional programming purists also program their functions to do one single task. This has benefits, but can complicate things. The function we wrote previously does two things: it computes the square root of a number and also returns the number of iterations it took to compute the result. However, this is not a bad thing; the function is doing two tasks, but these tasks are related to each other and it makes sense to have them together. My piece of advice: avoid having functions that do many unrelated things. This makes debugging harder.
In conclusion: you should strive for referential transparency, try to avoid side effects unless you have a good reason to have them and try to keep your functions short and do as little tasks as possible. This makes testing and debugging easier, as you will see.
8.3 Functional programming with {purrr}
Hadley Wickham developed a package called purrr
which contains a lot of very useful functions.
8.3.1 The map*()
family of functions
In the previous section we saw how to map a function to each element of a list. Each version of an
*apply()
function has a different purpose, but it is not very easy to remember which one returns
a list, which other one returns an atomic vector and so on. If you’re working on data frames you
can use apply()
to sum (for example) over columns or rows, because you can specify which MARGIN
you want to sum over. But you do not get a data frame back. In the purrr
package, each of the
functions that do mapping have a similar name. The first part of these functions’ names all start
with map_
and the second part tells you what this function is going to output. For example, if
you want double
s out, you would use map_dbl()
. If you are working on data frames want a data
frame back, you would use map_df()
. These are much more intuitive and easier to remember and we’re
going to learn how to use them in the chapter about The Tidyverse. For now, let’s just focus on
the basic functions, map()
and reduce()
(and some variants of reduce()
). To map a function
to every element of a list, simply use map()
:
library("purrr")
numbers <- c(7, 8, 19, 64)
map(numbers, sqrt_newton, init = 1)
## [[1]]
## [1] 2.645767
##
## [[2]]
## [1] 2.828469
##
## [[3]]
## [1] 4.358902
##
## [[4]]
## [1] 8.000002
If you want print “hello” using a function from purrr
you would need to use rerun()
:
rerun(10, "hello")
## [[1]]
## [1] "hello"
##
## [[2]]
## [1] "hello"
##
## [[3]]
## [1] "hello"
##
## [[4]]
## [1] "hello"
##
## [[5]]
## [1] "hello"
##
## [[6]]
## [1] "hello"
##
## [[7]]
## [1] "hello"
##
## [[8]]
## [1] "hello"
##
## [[9]]
## [1] "hello"
##
## [[10]]
## [1] "hello"
rerun()
simply runs an expression (which can be arbitrarily complex) n
times, whereas map()
maps a function to a list of inputs, so to achieve the same with map()
, you need to map the print()
function to a vector of characters:
map(rep("hello", 10), print)
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [[1]]
## [1] "hello"
##
## [[2]]
## [1] "hello"
##
## [[3]]
## [1] "hello"
##
## [[4]]
## [1] "hello"
##
## [[5]]
## [1] "hello"
##
## [[6]]
## [1] "hello"
##
## [[7]]
## [1] "hello"
##
## [[8]]
## [1] "hello"
##
## [[9]]
## [1] "hello"
##
## [[10]]
## [1] "hello"
rep()
is a function that creates a vector by repeating something, in this case the string “hello”,
as many times as needed, here 10. The output here is a bit different that before though, because first
you will see “hello” printed 10 times, but map()
always returns a list, this means that you
will also get a list where each element is the string “hello”.
We know the standard map()
function, which returns a list, but there are a number of
variants of this function. map_dbl()
returns an atomic vector of doubles:
map_dbl(numbers, sqrt_newton, init = 1)
## [1] 2.645767 2.828469 4.358902 8.000002
map_chr()
returns an atomic vector of strings:
map_chr(numbers, sqrt_newton, init = 1)
## [1] "2.645767" "2.828469" "4.358902" "8.000002"
map_lgl()
returns an atomic vector of TRUE
or FALSE
:
divisible <- function(x, y){
if_else(x %% y == 0, TRUE, FALSE)
}
map_lgl(seq(1:100), divisible, 3)
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [12] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [23] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
## [34] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [45] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [56] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
## [67] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [78] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [89] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
## [100] FALSE
There are also other interesting variants, such as map_if()
:
a <- seq(1,10)
map_if(a, (function(x) divisible(x, 2)), sqrt)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 5
##
## [[6]]
## [1] 2.44949
##
## [[7]]
## [1] 7
##
## [[8]]
## [1] 2.828427
##
## [[9]]
## [1] 9
##
## [[10]]
## [1] 3.162278
I used map_if()
to take the square root of only those numbers in vector a
that are divisble by 2,
by using an anonymous function that checks if a number is divisible by 2 (by wrapping divisible()
).
map_at()
is similar to map_if()
but maps the function at a position specified by the user:
map_at(numbers, c(1, 3), sqrt)
## [[1]]
## [1] 2.645751
##
## [[2]]
## [1] 8
##
## [[3]]
## [1] 4.358899
##
## [[4]]
## [1] 64
or if you have a named list:
recipe <- list("spam" = 1, "eggs" = 3, "bacon" = 10)
map_at(recipe, "bacon", `*`, 2)
## $spam
## [1] 1
##
## $eggs
## [1] 3
##
## $bacon
## [1] 20
I used map_at()
to double the quantity of bacon in the recipe (by using the *
function, and specifying
its second argument, 2
. Try the following in the command prompt: `*`(3, 4)
).
map2()
is the equivalent of mapply()
and pmap()
is the generalisation of map2()
for more
than 2 arguments:
print(a)
## [1] 1 2 3 4 5 6 7 8 9 10
b <- seq(1, 2, length.out = 10)
print(b)
## [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667
## [8] 1.777778 1.888889 2.000000
map2(a, b, `*`)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2.222222
##
## [[3]]
## [1] 3.666667
##
## [[4]]
## [1] 5.333333
##
## [[5]]
## [1] 7.222222
##
## [[6]]
## [1] 9.333333
##
## [[7]]
## [1] 11.66667
##
## [[8]]
## [1] 14.22222
##
## [[9]]
## [1] 17
##
## [[10]]
## [1] 20
n <- seq(1:10)
pmap(list(a, b, n), rnorm)
## [[1]]
## [1] -0.3390433
##
## [[2]]
## [1] 2.9112815 0.6336281
##
## [[3]]
## [1] -0.3468364 -5.0780178 4.8137809
##
## [[4]]
## [1] 2.8895890 -0.6554886 -3.3664595 1.7246887
##
## [[5]]
## [1] 13.510076 11.967669 5.364994 -1.626470 2.279362
##
## [[6]]
## [1] 1.9960271 -5.9273471 10.2248510 6.4238841 -0.9296016 -0.6230772
##
## [[7]]
## [1] -3.56441591 9.36087823 -2.18931874 11.29486286 -7.70610720 12.14668245
## [7] 0.05655167
##
## [[8]]
## [1] -9.0300679 -15.5087408 -0.5029011 -6.7685115 -0.8930143 8.6610726
## [7] 4.3739079 -1.5743869
##
## [[9]]
## [1] -2.953657 9.719432 -5.994068 -3.860638 -4.197553 4.104475 3.926871
## [8] 11.128557 3.842179
##
## [[10]]
## [1] 5.099272 -5.734222 -5.000404 8.345451 9.945886 20.915801
## [7] 13.787797 -13.270184 15.504936 16.635813
8.3.2 Reducing with purrr
In the purrr
package, you can find two more functions for folding: reduce()
and
reduce_right()
. The difference between reduce()
and reduce_right()
is pretty obvious:
reduce_right()
starts from the right!
For operations that are not commutative, this makes a difference. Other interesting folding
functions are accumulate()
and accumulate_right()
:
a <- seq(1, 10)
accumulate(a, `-`)
accumulate_right(a, `-`)
These two functions keep the intermediary results.
In the previous chapter, we wrote a loop to compute the sum of the 100 first integers. We can do
the same with purrr::reduce()
:
result = reduce(seq(1,100), `+`)
print(result)
## [1] 5050
You certainly agree with me that is simpler to understand. You can even see what happens in more
detail using accumulate
:
accumulate(seq(1, 100), `+`)
8.3.3 safely()
and possibly()
safely()
and possibly()
are very useful functions. Consider the following situation:
a <- list("a", 4, 5)
sqrt(a)
Error in sqrt(a) : non-numeric argument to mathematical function
Using map()
or Map()
will result in a similar error. safely()
is an higher-order function that
takes one function as an argument and executes it… safely, meaning the execution of the function
will not stop if there is an error. The error message gets captured alongside valid results.
a <- list("a", 4, 5)
safe_sqrt <- safely(sqrt)
map(a, safe_sqrt)
## [[1]]
## [[1]]$result
## NULL
##
## [[1]]$error
## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function>
##
##
## [[2]]
## [[2]]$result
## [1] 2
##
## [[2]]$error
## NULL
##
##
## [[3]]
## [[3]]$result
## [1] 2.236068
##
## [[3]]$error
## NULL
possibly()
works similarly, but also allows you to specify a return value in case of an error:
possible_sqrt <- possibly(sqrt, otherwise = NA_real_)
map(a, possible_sqrt)
## [[1]]
## [1] NA
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 2.236068
Of course, in this particular example, the same effect could be obtained way more easily:
sqrt(as.numeric(a))
## Warning: NAs introduced by coercion
## [1] NA 2.000000 2.236068
However, in some situations, this trick does not work as intended (or at all), so possibly()
and
safely()
are the way to go.
8.3.4 is_*()
and as_*()
functions
Remember in Chapter 3, when I introduced is.*()
and as.*()
functions? I told you then that we
were going to learn about is_*()
and as_*()
in Chapter 9. This is it!
8.3.5 «Transposing lists»
Another interesting function is transpose()
. It is not an alternative to the function t()
from
base
but, has a similar effect. transpose()
works on lists. Let’s take a look at the example
from before:
safe_sqrt <- safely(sqrt, otherwise = NA_real_)
map(a, safe_sqrt)
## [[1]]
## [[1]]$result
## [1] NA
##
## [[1]]$error
## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function>
##
##
## [[2]]
## [[2]]$result
## [1] 2
##
## [[2]]$error
## NULL
##
##
## [[3]]
## [[3]]$result
## [1] 2.236068
##
## [[3]]$error
## NULL
The output is a list with the first element being a list with a result and an error message. One
might want to have all the results in a single list, and all the error messages in another list.
This is possible with transpose()
:
purrr::transpose(map(a, safe_sqrt))
## $result
## $result[[1]]
## [1] NA
##
## $result[[2]]
## [1] 2
##
## $result[[3]]
## [1] 2.236068
##
##
## $error
## $error[[1]]
## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function>
##
## $error[[2]]
## NULL
##
## $error[[3]]
## NULL
I explicitely call purrr::transpose()
because there is also a data.table::transpose()
, which
is not the same function. You have to be careful about that sort of thing, because it can cause
errors in your programs and debuging this type of error is a nightmare.
8.4 Using your functions inside mutate()
Once you wrote a function, you can easily use it inside a pipe workflow:
double_number <- function(x){
x+x
}
mtcars %>%
mutate(double_mpg = double_number(mpg))
## mpg cyl disp hp drat wt qsec vs am gear carb double_mpg
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 42.0
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 42.0
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 45.6
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 42.8
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 37.4
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 36.2
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 28.6
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 48.8
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 45.6
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 38.4
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 35.6
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 32.8
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 34.6
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 30.4
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 20.8
## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 20.8
## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 29.4
## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 64.8
## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 60.8
## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 67.8
## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 43.0
## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 31.0
## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 30.4
## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 26.6
## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 38.4
## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 54.6
## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 52.0
## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 60.8
## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 31.6
## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 39.4
## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 30.0
## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 42.8
Granted this example is stupid, but it shows you, again, that functions you define are nothing special. You can use them just as any other.
You can also avoid to define a function altogether, especially if you need an operation only once,
by using the .
like this:
mtcars %>%
mutate(double_mpg = .$mpg + .$mpg)
## mpg cyl disp hp drat wt qsec vs am gear carb double_mpg
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 42.0
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 42.0
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 45.6
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 42.8
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 37.4
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 36.2
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 28.6
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 48.8
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 45.6
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 38.4
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 35.6
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 32.8
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 34.6
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 30.4
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 20.8
## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 20.8
## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 29.4
## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 64.8
## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 60.8
## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 67.8
## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 43.0
## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 31.0
## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 30.4
## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 26.6
## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 38.4
## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 54.6
## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 52.0
## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 60.8
## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 31.6
## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 39.4
## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 30.0
## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 42.8
8.5 Working with a list of datasets
8.5.1 Using map()
to work on lists of datasets
This is our first encouter with a typical functional programming function, map()
.
Let’s read the list of datasets from the previous chapter:
paths <- Sys.glob("datasets/unemployment/*.csv")
all_datasets <- import_list(paths)
str(all_datasets)
## List of 4
## $ unemp_2013:'data.frame': 118 obs. of 8 variables:
## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ Total employed population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ...
## ..$ of which: Wage-earners : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ...
## ..$ of which: Non-wage-earners: int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ...
## ..$ Unemployed : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ...
## ..$ Active population : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ...
## ..$ Unemployment rate (in %) : num [1:118] 7.95 5.67 6.27 2.88 4.92 ...
## ..$ Year : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ unemp_2014:'data.frame': 118 obs. of 8 variables:
## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ Total employed population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ...
## ..$ of which: Wage-earners : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ...
## ..$ of which: Non-wage-earners: int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ...
## ..$ Unemployed : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ...
## ..$ Active population : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ...
## ..$ Unemployment rate (in %) : num [1:118] 7.81 5.54 6.46 2.2 4.2 ...
## ..$ Year : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ unemp_2015:'data.frame': 118 obs. of 8 variables:
## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ Total employed population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ...
## ..$ of which: Wage-earners : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ...
## ..$ of which: Non-wage-earners: int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ...
## ..$ Unemployed : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ...
## ..$ Active population : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ...
## ..$ Unemployment rate (in %) : num [1:118] 7.46 5.12 5.62 3.23 4.73 ...
## ..$ Year : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ unemp_2016:'data.frame': 118 obs. of 8 variables:
## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ Total employed population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ...
## ..$ of which: Wage-earners : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ...
## ..$ of which: Non-wage-earners: int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ...
## ..$ Unemployed : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ...
## ..$ Active population : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ...
## ..$ Unemployment rate (in %) : num [1:118] 7.15 5.04 4.84 3.01 4.3 ...
## ..$ Year : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
For working with lists, another package from the tidyverse
that is very useful, is called
purrr
. purrr
will be presented in-depth in Chapter 9.
The first thing we are going to do is use a function to clean the names of the datasets. These
names are not very easy to work with; there are spaces, and it would be better if the names of the
columns would be all lowercase. For this we are going to use the function clean_names()
from the
janitor
package. For a single dataset, I would write this:
library(janitor)
one_dataset = one_dataset %>%
clean_names()
and I would get a dataset with column names in lowercase and spaces replaced by _
(and other
corrections). How can I apply, or map, this function to each dataset in the list? To do this I need
to use purrr::map()
:
library(purrr)
all_datasets <- all_datasets %>%
map(clean_names)
all_datasets %>%
glimpse()
## List of 4
## $ unemp_2013:'data.frame': 118 obs. of 8 variables:
## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ total_employed_population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ...
## ..$ of_which_wage_earners : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ...
## ..$ of_which_non_wage_earners : int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ...
## ..$ unemployed : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ...
## ..$ active_population : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ...
## ..$ unemployment_rate_in_percent: num [1:118] 7.95 5.67 6.27 2.88 4.92 ...
## ..$ year : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ unemp_2014:'data.frame': 118 obs. of 8 variables:
## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ total_employed_population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ...
## ..$ of_which_wage_earners : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ...
## ..$ of_which_non_wage_earners : int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ...
## ..$ unemployed : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ...
## ..$ active_population : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ...
## ..$ unemployment_rate_in_percent: num [1:118] 7.81 5.54 6.46 2.2 4.2 ...
## ..$ year : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ unemp_2015:'data.frame': 118 obs. of 8 variables:
## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ total_employed_population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ...
## ..$ of_which_wage_earners : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ...
## ..$ of_which_non_wage_earners : int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ...
## ..$ unemployed : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ...
## ..$ active_population : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ...
## ..$ unemployment_rate_in_percent: num [1:118] 7.46 5.12 5.62 3.23 4.73 ...
## ..$ year : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ unemp_2016:'data.frame': 118 obs. of 8 variables:
## ..$ commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ total_employed_population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ...
## ..$ of_which_wage_earners : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ...
## ..$ of_which_non_wage_earners : int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ...
## ..$ unemployed : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ...
## ..$ active_population : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ...
## ..$ unemployment_rate_in_percent: num [1:118] 7.15 5.04 4.84 3.01 4.3 ...
## ..$ year : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
So now, what if I want to know, for each dataset, which communes have an unemployment rate that is less than, say, 3%? For a single dataset I would do something like this:
one_dataset %>%
filter(unemployment_rate_in_percent < 3)
But for a list of datasets, map()
is needed (and as you will see, that is not all that is needed):
all_datasets %>%
map(~filter(., unemployment_rate_in_percent < 3))
## $unemp_2013
## commune total_employed_population of_which_wage_earners
## 1 Garnich 844 750
## 2 Leudelange 1064 937
## 3 Bech 526 463
## of_which_non_wage_earners unemployed active_population
## 1 94 25 869
## 2 127 32 1096
## 3 63 16 542
## unemployment_rate_in_percent year
## 1 2.876870 2013
## 2 2.919708 2013
## 3 2.952030 2013
##
## $unemp_2014
## commune total_employed_population of_which_wage_earners
## 1 Garnich 845 757
## 2 Leudelange 1102 965
## 3 Bech 543 476
## 4 Flaxweiler 879 789
## of_which_non_wage_earners unemployed active_population
## 1 88 19 864
## 2 137 34 1136
## 3 67 15 558
## 4 90 27 906
## unemployment_rate_in_percent year
## 1 2.199074 2014
## 2 2.992958 2014
## 3 2.688172 2014
## 4 2.980132 2014
##
## $unemp_2015
## commune total_employed_population of_which_wage_earners
## 1 Bech 520 450
## 2 Bous 750 680
## of_which_non_wage_earners unemployed active_population
## 1 70 14 534
## 2 70 22 772
## unemployment_rate_in_percent year
## 1 2.621723 2015
## 2 2.849741 2015
##
## $unemp_2016
## commune total_employed_population of_which_wage_earners
## 1 Reckange-sur-Mess 980 850
## 2 Bech 520 450
## 3 Betzdorf 1500 1350
## 4 Flaxweiler 910 820
## of_which_non_wage_earners unemployed active_population
## 1 130 30 1010
## 2 70 11 531
## 3 150 45 1545
## 4 90 24 934
## unemployment_rate_in_percent year
## 1 2.970297 2016
## 2 2.071563 2016
## 3 2.912621 2016
## 4 2.569593 2016
I know what you’re thinking… what the hell?. Let me explain: map()
needs a function to map to
each element of the list. all_datasets
is the list to which I want to map the function. But what
function? filter()
is the function I need, so why doesn’t:
all_datasets %>%
map(filter(unemployment_rate_in_percent < 3))
work? This is a bit complicated, and has to do with what is called environments. If you try to run the code above, you will get this error message:
Error in filter(unemployment_rate_in_percent < 3) :
object 'unemployment_rate_in_percent' not found
I won’t go into details, but by writing ~filter(., unemployment_rate_in_percent < 3)
, which is a
formula (~
is the symbol to define formulas, more on this in the later chapters), map()
converts it to a function that it can use. If you want to know more about this, you can read it in
Advanced R by Hadley Wickham, but it
is an advanced topic.
8.6 Mapping your homebrewed functions to lists of datasets
Before merging these datasets together, we would need them to have a year
column indicating the
year. It would also be helpful if gave names to these datasets. For this task, we can use
purrr::set_names()
:
all_datasets = set_names(all_datasets, as.character(seq(2013, 2016)))
Let’s take a look at the list now:
str(all_datasets)
As you can see, each data.frame
object contained in the list has been renamed. You can thus
access them with the $
operator:
8.6.1 Data frames and reduce
Using map()
we now know how to apply a function to each dataset of a list. But maybe it would be
easier to merge all the datasets first, and then manipulate them? Before that though, I am going to
teach you how to use purrr::reduce()
, another very powerful function that works on lists. This is
a function that you can find in other programming languages, but sometimes it is called fold.
I think that the following example illustrates the power of reduce()
well:
numbers = seq(1, 5) # Create a vector with the numbers 1 to 5
reduce(numbers, `+`, .init = 0)
## [1] 15
reduce()
takes a function as an argument, here the function +
7 and then does the following
computation:
0 + numbers[1] + numbers[2] + numbers[3]...
It applies the user supplied function successively but has to start with something, so we give it
the argument init
also. This argument is actually optional, but I show it here because in some
cases it might be useful to start the computations at another value than 0
.reduce()
generalizes functions that only take two arguments. If you were to write a function that returns
the minimum between two numbers:
my_min = function(a, b){
if(a < b){
return(a)
} else {
return(b)
}
}
You could use reduce()
to get the minimum of a list of numbers:
numbers2 = c(3, 1, -8, 9)
reduce(numbers2, my_min)
## [1] -8
As long as you provide a function and a list of elements to reduce()
, you will get a single
output. So how could reduce()
help us with merging all the datasets that are in the list? dplyr
comes with a lot of function to merge two datasets. Remember that I said before that reduce()
allows you to generalize a function of two arguments? Let’s try it with our list of datasets:
unemp_lux = reduce(all_datasets, full_join)
## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
glimpse(unemp_lux)
## Observations: 472
## Variables: 8
## $ commune <chr> "Grand-Duche de Luxembourg", "Canto…
## $ total_employed_population <int> 223407, 17802, 1703, 844, 1431, 409…
## $ of_which_wage_earners <int> 203535, 15993, 1535, 750, 1315, 380…
## $ of_which_non_wage_earners <int> 19872, 1809, 168, 94, 116, 294, 272…
## $ unemployed <int> 19287, 1071, 114, 25, 74, 261, 98, …
## $ active_population <int> 242694, 18873, 1817, 869, 1505, 435…
## $ unemployment_rate_in_percent <dbl> 7.947044, 5.674773, 6.274078, 2.876…
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013,…
full_join()
is one of the dplyr
function that merges data. There are others that might be
useful depending on the kind of join operation you need. Let’s write this data to disk as we’re
going to keep using it for the next chapters:
export(unemp_lux, "datasets/unemp_lux.csv")
8.7 Functional programming and plotting
In this section, we are going to learn how to use the possibilities offered by the purrr
package
and how it can work together with ggplot2
to generate many plots. This is a more advanced topic,
but what comes next is also what makes R, and the functional programming paradigm so powerful.
For example, suppose that instead of wanting a single plot with the unemployment rate of each commune, you need one unemployment plot, per commune:
unemp_lux_data %>%
filter(division == "Luxembourg") %>%
ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
theme_minimal() +
labs(title = "Unemployment in Luxembourg", x = "Year", y = "Rate") +
geom_line()
and then you would write the same for “Esch-sur-Alzette” and also for “Wiltz”. If you only have to make to make these 3 plots, copy and pasting the above lines is no big deal:
unemp_lux_data %>%
filter(division == "Esch-sur-Alzette") %>%
ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
theme_minimal() +
labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
geom_line()
unemp_lux_data %>%
filter(division == "Wiltz") %>%
ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
theme_minimal() +
labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
geom_line()
Put copy and pasting is error prone. Can you spot the copy-paste mistake I made? And what if you
have to create the above plots for all 108 Luxembourguish communes? That’s a lot of copy pasting.
What if, once you are done copy pasting, you have to change something, for example, the theme? You
could use the search and replace function of RStudio, true, but sometimes search and replace can
also introduce bugs and typos. You can avoid all these issues by using purrr::map()
. What do you
need to map over? The commune names. So let’s create a vector of commune names:
communes <- list("Luxembourg", "Esch-sur-Alzette", "Wiltz")
Now we can create the graphs using map()
, or map2()
to be exact:
plots_tibble = unemp_lux_data %>%
filter(division %in% communes) %>%
group_by(division) %>%
nest() %>%
mutate(plot = map2(.x = data, .y = division, ~ggplot(data = .x) +
theme_minimal() +
geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
labs(title = paste("Unemployment in", .y))))
Let’s study this line by line: the first line is easy, we simply use filter()
to keep only the
communes we are interested in. Then we group by division
and use tidyr::nest()
. As a refresher,
let’s take a look at what this does:
unemp_lux_data %>%
filter(division %in% communes) %>%
group_by(division) %>%
nest()
## # A tibble: 3 x 2
## division data
## <chr> <list>
## 1 Esch-sur-Alzette <tibble [15 × 7]>
## 2 Luxembourg <tibble [15 × 7]>
## 3 Wiltz <tibble [15 × 7]>
This creates a tibble with two columns, division
and data
, where each individual (or
commune in this case) is another tibble with all the original variables. This is very useful,
because now we can pass these tibbles to map2()
, to generate the plots. But why map2()
and
what’s the difference with map()
? map2()
works the same way as map()
, but maps over two
inputs:
numbers1 <- list(1, 2, 3, 4, 5)
numbers2 <- list(9, 8, 7, 6, 5)
map2(numbers1, numbers2, `*`)
## [[1]]
## [1] 9
##
## [[2]]
## [1] 16
##
## [[3]]
## [1] 21
##
## [[4]]
## [1] 24
##
## [[5]]
## [1] 25
In our example with the graphs, the two inputs are the data, and the names of the communes. This is
useful to create the title with labs(title = paste("Unemployment in", .y))))
where .y
is the
second input of map2()
, the commune names contained in variable division
.
So what happened? We now have a tibble called plots_tibble
that looks like this:
print(plots_tibble)
## # A tibble: 3 x 3
## division data plot
## <chr> <list> <list>
## 1 Esch-sur-Alzette <tibble [15 × 7]> <S3: gg>
## 2 Luxembourg <tibble [15 × 7]> <S3: gg>
## 3 Wiltz <tibble [15 × 7]> <S3: gg>
This tibble contains three columns, division
, data
and now a new one called plot
, that we
created before using the last line mutate(plot = ...)
(remember that mutate()
adds columns to
tibbles). plot
is a list-column, with elements… being plots! Yes you read that right, the
elements of the column plot
are literally plots. This is what I meant with list columns.
Let’s see what is inside the data
and the plot
columns exactly:
plots_tibble %>%
pull(data)
## [[1]]
## # A tibble: 15 x 7
## year active_populati… of_which_non_wa… of_which_wage_e…
## <int> <dbl> <dbl> <dbl>
## 1 2001 11.3 665 10.1
## 2 2002 11.7 677 10.3
## 3 2003 11.7 674 10.2
## 4 2004 12.2 659 10.6
## 5 2005 11.9 654 10.3
## 6 2006 12.2 657 10.5
## 7 2007 12.6 634 10.9
## 8 2008 12.9 638 11.0
## 9 2009 13.2 652 11.0
## 10 2010 13.6 638 11.2
## 11 2011 13.9 630 11.5
## 12 2012 14.3 684 11.8
## 13 2013 14.8 694 12.0
## 14 2014 15.2 703 12.5
## 15 2015 15.3 710 12.6
## # … with 3 more variables: total_employed_population <dbl>,
## # unemployed <dbl>, unemployment_rate_in_percent <dbl>
##
## [[2]]
## # A tibble: 15 x 7
## year active_populati… of_which_non_wa… of_which_wage_e…
## <int> <dbl> <dbl> <dbl>
## 1 2001 34.4 2.89 30.4
## 2 2002 34.8 2.94 30.3
## 3 2003 35.2 3.03 30.1
## 4 2004 35.6 3.06 30.1
## 5 2005 35.6 3.13 29.8
## 6 2006 35.5 3.12 30.3
## 7 2007 36.1 3.25 31.1
## 8 2008 37.5 3.39 31.9
## 9 2009 37.9 3.49 31.6
## 10 2010 38.6 3.54 32.1
## 11 2011 40.3 3.66 33.6
## 12 2012 41.8 3.81 34.6
## 13 2013 43.4 3.98 35.5
## 14 2014 44.6 4.11 36.7
## 15 2015 45.2 4.14 37.5
## # … with 3 more variables: total_employed_population <dbl>,
## # unemployed <dbl>, unemployment_rate_in_percent <dbl>
##
## [[3]]
## # A tibble: 15 x 7
## year active_populati… of_which_non_wa… of_which_wage_e…
## <int> <dbl> <dbl> <dbl>
## 1 2001 2.13 223 1.79
## 2 2002 2.14 220 1.78
## 3 2003 2.18 223 1.79
## 4 2004 2.24 227 1.85
## 5 2005 2.26 229 1.85
## 6 2006 2.20 206 1.82
## 7 2007 2.27 198 1.88
## 8 2008 2.30 200 1.90
## 9 2009 2.36 201 1.94
## 10 2010 2.42 195 1.97
## 11 2011 2.48 190 2.02
## 12 2012 2.59 188 2.10
## 13 2013 2.66 195 2.15
## 14 2014 2.69 185 2.19
## 15 2015 2.77 180 2.27
## # … with 3 more variables: total_employed_population <dbl>,
## # unemployed <dbl>, unemployment_rate_in_percent <dbl>
each element of data is a tibble for the specific country with columns year
, active_population
,
etc, the original columns. But obviously, there is no division
column. So to plot the data, and
join all the dots together, we need to add group = 1
in the call to ggplot2()
(whereas if you
plot multiple lines in the same graph, you need to write group = division
).
But more interestingly, how can you actually see the plots? If you want to simply look at them, it
is enough to use pull()
:
plots_tibble %>%
pull(plot)
## [[1]]
##
## [[2]]
##
## [[3]]
And if we want to save these plots, we can do so using map2()
:
map2(paste0(plots_tibble$division, ".pdf"), plots_tibble$plot, ggsave)
Saving 7 x 5 in image
Saving 6.01 x 3.94 in image
Saving 6.01 x 3.94 in image
This was probably the most advanced topic we have studied yet; but you probably agree with me that
it is among the most useful ones. This section is a perfect illustration of the power of functional
programming; you can mix and match functions as long as you give them the correct arguments.
You can pass data to functions that use data and then pass these functions to other functions that
use functions as arguments, such as map()
.8 map()
does not care if the functions you pass to it produces tables,
graphs or even another function. map()
will simply map this function to a list of inputs, and as
long as these inputs are correct arguments to the function, map()
will do its magic. If you
combine this with list-columns, you can even use map()
alongside dplyr
functions and map your
function by first grouping, filtering, etc…
8.8 Exercises
Suppose you have an Excel workbook that contains data on three sheets. Create a function that reads entire workbooks, and that returns a list of tibbles, where each tibble is the data of one sheet (download the example Excel workbook,
example_workbook.xlsx
, from theassets
folder on the books github).Use one of the
map()
functions to combine two lists into one. Consider the following two lists:
mediterranean <- list("starters" = list("humous", "lasagna"), "dishes" = list("sardines", "olives"))
continental <- list("starters" = list("pea soup", "terrine"), "dishes" = list("frikadelle", "sauerkraut"))
The result we’d like to have would look like this:
$starters
$starters[[1]]
[1] "humous"
$starters[[2]]
[1] "olives"
$starters[[3]]
[1] "pea soup"
$starters[[4]]
[1] "terrine"
$dishes
$dishes[[1]]
[1] "sardines"
$dishes[[2]]
[1] "lasagna"
$dishes[[3]]
[1] "frikadelle"
$dishes[[4]]
[1] "sauerkraut"