Chapter 8 Functional programming

By now, you are pretty familiar with functions in R. Now, let’s introduce the functional programming paradigm, but first, let’s go over some formal definitions.

8.1 Functions definition

You should be familiar with function definitions in R. For example, suppose you want to compute the square root of a number and want to do so using Newton’s algorithm:

sqrt_newton <- function(a, init, eps = 0.01){
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
    }
    return(init)
}

You can then use this function to get the square root of a number:

sqrt_newton(16, 2)
## [1] 4.00122

We are using a while loop inside the body. The body of a function are the instructions that define the function. You can get the body of a function with body(some_func) of the function. In pure functional programming languages, like Haskell, you don’t have loops. How can you program without loops, you may ask? In functional programming, loops are replaced by recursion, which we already discussed in the previous chapter. Let’s rewrite our little example above with recursion:

sqrt_newton_recur <- function(a, init, eps = 0.01){
    if(abs(init**2 - a) < eps){
        result <- init
    } else {
        init <- 1/2 * (init + a/init)
        result <- sqrt_newton_recur(a, init, eps)
    }
    return(result)
}
sqrt_newton_recur(16, 2)
## [1] 4.00122

R is not a pure functional programming language though, so we can still use loops (be it while or for loops) in the bodies of our functions. As discussed in the previous chapter, it is actually better, performance-wise, to use loops instead of recursion, because R is not tail-call optimized. I won’t got into the details of what tail-call optimization is but just remember that if performance is important a loop will be faster. However, sometimes, it is easier to write a function using recursion. I personally tend to avoid loops if performance is not important, because I find that code that avoids loops is easier to read and debug. However, knowing that you can use loops is reassuring. In the coming sections I will show you some built-in functions that make it possible to avoid writing loops and that don’t rely on recursion, so performance won’t be penalized.

8.2 Properties of functions

Mathematical functions have a nice property: we always get the same output for a given input. This is called referential transparency and we should aim to write our R functions in such a way.

For example, the following function:

increment <- function(x){
    return(x + 1)
}

Is a referential transparent function. We always get the same result for any x that we give to this function.

This:

increment(10)
## [1] 11

will always produce 11.

However, this one:

increment_opaque <- function(x){
    return(x + spam)
}

is not a referential transparent function, because its value depends on the global variable spam.

spam <- 1

increment_opaque(10)
## [1] 11

will only produce 11 if spam = 1. But what if spam = 19?

spam <- 19

increment_opaque(10)
## [1] 29

To make increment_opaque() a referential transparent function, it is enough to make spam an argument:

increment_not_opaque <- function(x, spam){
    return(x + spam)
}

Now even if there is a global variable called spam, this will not influence our function:

spam <- 19

increment_not_opaque(10, 34)
## [1] 44

This is because the variable spam defined in the body of the function is a local variable. It could have been called anything else, really. Avoiding opaque functions makes our life easier.

Another property that adepts of functional programming value is that functions should have no, or very limited, side-effects. This means that functions should not change the state of your program.

For example this function (which is not a referential transparent function):

count_iter <- 0

sqrt_newton_side_effect <- function(a, init, eps = 0.01){
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
        count_iter <<- count_iter + 1 # The "<<-" symbol means that we assign the
    }                                 # RHS value in a variable in the global environment
    return(init)
}

If you look in the environment pane, you will see that count_iter equals 0. Now call this function with the following arguments:

sqrt_newton_side_effect(16000, 2)
## [1] 126.4911
print(count_iter)
## [1] 9

If you check the value of count_iter now, you will see that it increased! This is a side effect, because the function changed something outside of its scope. It changed a value in the global environment. In general, it is good practice to avoid side-effects. For example, we could make the above function not have any side effects like this:

sqrt_newton_count <- function(a, init, count_iter = 0, eps = 0.01){
    while(abs(init**2 - a) > eps){
        init <- 1/2 *(init + a/init)
        count_iter <- count_iter + 1
    }
    return(c(init, count_iter))
}

Now, this function returns a list with two elements, the result, and the number of iterations it took to get the result:

sqrt_newton_count(16000, 2)
## [1] 126.4911   9.0000

Writing to disk is also considered a side effect, because the function changes something (a file) outside its scope. But this cannot be avoided since you want to write to disk. Just remember: try to avoid having functions changing variables in the global environment unless you have a very good reason of doing so.

Finally, another property of mathematical functions, is that they do one single thing. Functional programming purists also program their functions to do one single task. This has benefits, but can complicate things. The function we wrote previously does two things: it computes the square root of a number and also returns the number of iterations it took to compute the result. However, this is not a bad thing; the function is doing two tasks, but these tasks are related to each other and it makes sense to have them together. My piece of advice: avoid having functions that do many unrelated things. This makes debugging harder.

In conclusion: you should strive for referential transparency, try to avoid side effects unless you have a good reason to have them and try to keep your functions short and do as little tasks as possible. This makes testing and debugging easier, as you will see.

8.3 Functional programming with {purrr}

Hadley Wickham developed a package called purrr which contains a lot of very useful functions.

8.3.1 The map*() family of functions

In the previous section we saw how to map a function to each element of a list. Each version of an *apply() function has a different purpose, but it is not very easy to remember which one returns a list, which other one returns an atomic vector and so on. If you’re working on data frames you can use apply() to sum (for example) over columns or rows, because you can specify which MARGIN you want to sum over. But you do not get a data frame back. In the purrr package, each of the functions that do mapping have a similar name. The first part of these functions’ names all start with map_ and the second part tells you what this function is going to output. For example, if you want doubles out, you would use map_dbl(). If you are working on data frames want a data frame back, you would use map_df(). These are much more intuitive and easier to remember and we’re going to learn how to use them in the chapter about The Tidyverse. For now, let’s just focus on the basic functions, map() and reduce() (and some variants of reduce()). To map a function to every element of a list, simply use map():

library("purrr")
numbers <- c(7, 8, 19, 64)

map(numbers, sqrt_newton, init = 1)
## [[1]]
## [1] 2.645767
## 
## [[2]]
## [1] 2.828469
## 
## [[3]]
## [1] 4.358902
## 
## [[4]]
## [1] 8.000002

If you want print “hello” using a function from purrr you would need to use rerun():

rerun(10, "hello")
## [[1]]
## [1] "hello"
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] "hello"
## 
## [[4]]
## [1] "hello"
## 
## [[5]]
## [1] "hello"
## 
## [[6]]
## [1] "hello"
## 
## [[7]]
## [1] "hello"
## 
## [[8]]
## [1] "hello"
## 
## [[9]]
## [1] "hello"
## 
## [[10]]
## [1] "hello"

rerun() simply runs an expression (which can be arbitrarily complex) n times, whereas map() maps a function to a list of inputs, so to achieve the same with map(), you need to map the print() function to a vector of characters:

map(rep("hello", 10), print)
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [[1]]
## [1] "hello"
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] "hello"
## 
## [[4]]
## [1] "hello"
## 
## [[5]]
## [1] "hello"
## 
## [[6]]
## [1] "hello"
## 
## [[7]]
## [1] "hello"
## 
## [[8]]
## [1] "hello"
## 
## [[9]]
## [1] "hello"
## 
## [[10]]
## [1] "hello"

rep() is a function that creates a vector by repeating something, in this case the string “hello”, as many times as needed, here 10. The output here is a bit different that before though, because first you will see “hello” printed 10 times, but map() always returns a list, this means that you will also get a list where each element is the string “hello”.

We know the standard map() function, which returns a list, but there are a number of variants of this function. map_dbl() returns an atomic vector of doubles:

map_dbl(numbers, sqrt_newton, init = 1)
## [1] 2.645767 2.828469 4.358902 8.000002

map_chr() returns an atomic vector of strings:

map_chr(numbers, sqrt_newton, init = 1)
## [1] "2.645767" "2.828469" "4.358902" "8.000002"

map_lgl() returns an atomic vector of TRUE or FALSE:

divisible <- function(x, y){
  if_else(x %% y == 0, TRUE, FALSE)
}

map_lgl(seq(1:100), divisible, 3)
##   [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
##  [12]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
##  [23] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [34] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
##  [45]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
##  [56] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [67] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
##  [78]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
##  [89] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
## [100] FALSE

There are also other interesting variants, such as map_if():

a <- seq(1,10)

map_if(a, (function(x) divisible(x, 2)), sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 2.44949
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 2.828427
## 
## [[9]]
## [1] 9
## 
## [[10]]
## [1] 3.162278

I used map_if() to take the square root of only those numbers in vector a that are divisble by 2, by using an anonymous function that checks if a number is divisible by 2 (by wrapping divisible()).

map_at() is similar to map_if() but maps the function at a position specified by the user:

map_at(numbers, c(1, 3), sqrt)
## [[1]]
## [1] 2.645751
## 
## [[2]]
## [1] 8
## 
## [[3]]
## [1] 4.358899
## 
## [[4]]
## [1] 64

or if you have a named list:

recipe <- list("spam" = 1, "eggs" = 3, "bacon" = 10)

map_at(recipe, "bacon", `*`, 2)
## $spam
## [1] 1
## 
## $eggs
## [1] 3
## 
## $bacon
## [1] 20

I used map_at() to double the quantity of bacon in the recipe (by using the * function, and specifying its second argument, 2. Try the following in the command prompt: `*`(3, 4)).

map2() is the equivalent of mapply() and pmap() is the generalisation of map2() for more than 2 arguments:

print(a)
##  [1]  1  2  3  4  5  6  7  8  9 10
b <- seq(1, 2, length.out = 10)

print(b)
##  [1] 1.000000 1.111111 1.222222 1.333333 1.444444 1.555556 1.666667
##  [8] 1.777778 1.888889 2.000000
map2(a, b, `*`)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2.222222
## 
## [[3]]
## [1] 3.666667
## 
## [[4]]
## [1] 5.333333
## 
## [[5]]
## [1] 7.222222
## 
## [[6]]
## [1] 9.333333
## 
## [[7]]
## [1] 11.66667
## 
## [[8]]
## [1] 14.22222
## 
## [[9]]
## [1] 17
## 
## [[10]]
## [1] 20
n <- seq(1:10)

pmap(list(a, b, n), rnorm)
## [[1]]
## [1] -0.3390433
## 
## [[2]]
## [1] 2.9112815 0.6336281
## 
## [[3]]
## [1] -0.3468364 -5.0780178  4.8137809
## 
## [[4]]
## [1]  2.8895890 -0.6554886 -3.3664595  1.7246887
## 
## [[5]]
## [1] 13.510076 11.967669  5.364994 -1.626470  2.279362
## 
## [[6]]
## [1]  1.9960271 -5.9273471 10.2248510  6.4238841 -0.9296016 -0.6230772
## 
## [[7]]
## [1] -3.56441591  9.36087823 -2.18931874 11.29486286 -7.70610720 12.14668245
## [7]  0.05655167
## 
## [[8]]
## [1]  -9.0300679 -15.5087408  -0.5029011  -6.7685115  -0.8930143   8.6610726
## [7]   4.3739079  -1.5743869
## 
## [[9]]
## [1] -2.953657  9.719432 -5.994068 -3.860638 -4.197553  4.104475  3.926871
## [8] 11.128557  3.842179
## 
## [[10]]
##  [1]   5.099272  -5.734222  -5.000404   8.345451   9.945886  20.915801
##  [7]  13.787797 -13.270184  15.504936  16.635813

8.3.2 Reducing with purrr

In the purrr package, you can find two more functions for folding: reduce() and reduce_right(). The difference between reduce() and reduce_right() is pretty obvious: reduce_right() starts from the right!

For operations that are not commutative, this makes a difference. Other interesting folding functions are accumulate() and accumulate_right():

a <- seq(1, 10)

accumulate(a, `-`)
accumulate_right(a, `-`)

These two functions keep the intermediary results.

In the previous chapter, we wrote a loop to compute the sum of the 100 first integers. We can do the same with purrr::reduce():

result = reduce(seq(1,100), `+`)

print(result)
## [1] 5050

You certainly agree with me that is simpler to understand. You can even see what happens in more detail using accumulate:

accumulate(seq(1, 100), `+`)

8.3.3 safely() and possibly()

safely() and possibly() are very useful functions. Consider the following situation:

a <- list("a", 4, 5)

sqrt(a)
Error in sqrt(a) : non-numeric argument to mathematical function

Using map() or Map() will result in a similar error. safely() is an higher-order function that takes one function as an argument and executes it… safely, meaning the execution of the function will not stop if there is an error. The error message gets captured alongside valid results.

a <- list("a", 4, 5)

safe_sqrt <- safely(sqrt)

map(a, safe_sqrt)
## [[1]]
## [[1]]$result
## NULL
## 
## [[1]]$error
## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function>
## 
## 
## [[2]]
## [[2]]$result
## [1] 2
## 
## [[2]]$error
## NULL
## 
## 
## [[3]]
## [[3]]$result
## [1] 2.236068
## 
## [[3]]$error
## NULL

possibly() works similarly, but also allows you to specify a return value in case of an error:

possible_sqrt <- possibly(sqrt, otherwise = NA_real_)

map(a, possible_sqrt)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 2.236068

Of course, in this particular example, the same effect could be obtained way more easily:

sqrt(as.numeric(a))
## Warning: NAs introduced by coercion
## [1]       NA 2.000000 2.236068

However, in some situations, this trick does not work as intended (or at all), so possibly() and safely() are the way to go.

8.3.4 is_*() and as_*() functions

Remember in Chapter 3, when I introduced is.*() and as.*() functions? I told you then that we were going to learn about is_*() and as_*() in Chapter 9. This is it!

8.3.5 «Transposing lists»

Another interesting function is transpose(). It is not an alternative to the function t() from base but, has a similar effect. transpose() works on lists. Let’s take a look at the example from before:

safe_sqrt <- safely(sqrt, otherwise = NA_real_)

map(a, safe_sqrt)
## [[1]]
## [[1]]$result
## [1] NA
## 
## [[1]]$error
## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function>
## 
## 
## [[2]]
## [[2]]$result
## [1] 2
## 
## [[2]]$error
## NULL
## 
## 
## [[3]]
## [[3]]$result
## [1] 2.236068
## 
## [[3]]$error
## NULL

The output is a list with the first element being a list with a result and an error message. One might want to have all the results in a single list, and all the error messages in another list. This is possible with transpose():

purrr::transpose(map(a, safe_sqrt))
## $result
## $result[[1]]
## [1] NA
## 
## $result[[2]]
## [1] 2
## 
## $result[[3]]
## [1] 2.236068
## 
## 
## $error
## $error[[1]]
## <simpleError in .Primitive("sqrt")(x): non-numeric argument to mathematical function>
## 
## $error[[2]]
## NULL
## 
## $error[[3]]
## NULL

I explicitely call purrr::transpose() because there is also a data.table::transpose(), which is not the same function. You have to be careful about that sort of thing, because it can cause errors in your programs and debuging this type of error is a nightmare.

8.4 Using your functions inside mutate()

Once you wrote a function, you can easily use it inside a pipe workflow:

double_number <- function(x){
  x+x
}
mtcars %>%
  mutate(double_mpg = double_number(mpg))
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb double_mpg
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4       42.0
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4       42.0
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1       45.6
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1       42.8
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2       37.4
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1       36.2
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4       28.6
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2       48.8
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       45.6
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4       38.4
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4       35.6
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3       32.8
## 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3       34.6
## 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3       30.4
## 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4       20.8
## 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4       20.8
## 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4       29.4
## 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1       64.8
## 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2       60.8
## 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1       67.8
## 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1       43.0
## 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2       31.0
## 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2       30.4
## 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4       26.6
## 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2       38.4
## 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1       54.6
## 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2       52.0
## 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2       60.8
## 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4       31.6
## 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6       39.4
## 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8       30.0
## 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2       42.8

Granted this example is stupid, but it shows you, again, that functions you define are nothing special. You can use them just as any other.

You can also avoid to define a function altogether, especially if you need an operation only once, by using the . like this:

mtcars %>%
  mutate(double_mpg = .$mpg + .$mpg)
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb double_mpg
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4       42.0
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4       42.0
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1       45.6
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1       42.8
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2       37.4
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1       36.2
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4       28.6
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2       48.8
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       45.6
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4       38.4
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4       35.6
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3       32.8
## 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3       34.6
## 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3       30.4
## 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4       20.8
## 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4       20.8
## 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4       29.4
## 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1       64.8
## 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2       60.8
## 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1       67.8
## 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1       43.0
## 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2       31.0
## 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2       30.4
## 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4       26.6
## 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2       38.4
## 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1       54.6
## 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2       52.0
## 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2       60.8
## 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4       31.6
## 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6       39.4
## 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8       30.0
## 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2       42.8

8.5 Working with a list of datasets

8.5.1 Using map() to work on lists of datasets

This is our first encouter with a typical functional programming function, map().

Let’s read the list of datasets from the previous chapter:

paths <- Sys.glob("datasets/unemployment/*.csv")

all_datasets <- import_list(paths)

str(all_datasets)
## List of 4
##  $ unemp_2013:'data.frame':  118 obs. of  8 variables:
##   ..$ Commune                   : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ Total employed population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ...
##   ..$ of which: Wage-earners    : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ...
##   ..$ of which: Non-wage-earners: int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ...
##   ..$ Unemployed                : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ...
##   ..$ Active population         : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ...
##   ..$ Unemployment rate (in %)  : num [1:118] 7.95 5.67 6.27 2.88 4.92 ...
##   ..$ Year                      : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ unemp_2014:'data.frame':  118 obs. of  8 variables:
##   ..$ Commune                   : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ Total employed population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ...
##   ..$ of which: Wage-earners    : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ...
##   ..$ of which: Non-wage-earners: int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ...
##   ..$ Unemployed                : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ...
##   ..$ Active population         : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ...
##   ..$ Unemployment rate (in %)  : num [1:118] 7.81 5.54 6.46 2.2 4.2 ...
##   ..$ Year                      : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ unemp_2015:'data.frame':  118 obs. of  8 variables:
##   ..$ Commune                   : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ Total employed population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ...
##   ..$ of which: Wage-earners    : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ...
##   ..$ of which: Non-wage-earners: int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ...
##   ..$ Unemployed                : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ...
##   ..$ Active population         : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ...
##   ..$ Unemployment rate (in %)  : num [1:118] 7.46 5.12 5.62 3.23 4.73 ...
##   ..$ Year                      : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ unemp_2016:'data.frame':  118 obs. of  8 variables:
##   ..$ Commune                   : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ Total employed population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ...
##   ..$ of which: Wage-earners    : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ...
##   ..$ of which: Non-wage-earners: int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ...
##   ..$ Unemployed                : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ...
##   ..$ Active population         : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ...
##   ..$ Unemployment rate (in %)  : num [1:118] 7.15 5.04 4.84 3.01 4.3 ...
##   ..$ Year                      : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...

For working with lists, another package from the tidyverse that is very useful, is called purrr. purrr will be presented in-depth in Chapter 9.

The first thing we are going to do is use a function to clean the names of the datasets. These names are not very easy to work with; there are spaces, and it would be better if the names of the columns would be all lowercase. For this we are going to use the function clean_names() from the janitor package. For a single dataset, I would write this:

library(janitor)

one_dataset = one_dataset %>%
  clean_names()

and I would get a dataset with column names in lowercase and spaces replaced by _ (and other corrections). How can I apply, or map, this function to each dataset in the list? To do this I need to use purrr::map():

library(purrr)

all_datasets <- all_datasets %>%
  map(clean_names)

all_datasets %>%
  glimpse()
## List of 4
##  $ unemp_2013:'data.frame':  118 obs. of  8 variables:
##   ..$ commune                     : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ total_employed_population   : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ...
##   ..$ of_which_wage_earners       : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ...
##   ..$ of_which_non_wage_earners   : int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ...
##   ..$ unemployed                  : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ...
##   ..$ active_population           : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ...
##   ..$ unemployment_rate_in_percent: num [1:118] 7.95 5.67 6.27 2.88 4.92 ...
##   ..$ year                        : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ unemp_2014:'data.frame':  118 obs. of  8 variables:
##   ..$ commune                     : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ total_employed_population   : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ...
##   ..$ of_which_wage_earners       : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ...
##   ..$ of_which_non_wage_earners   : int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ...
##   ..$ unemployed                  : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ...
##   ..$ active_population           : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ...
##   ..$ unemployment_rate_in_percent: num [1:118] 7.81 5.54 6.46 2.2 4.2 ...
##   ..$ year                        : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ unemp_2015:'data.frame':  118 obs. of  8 variables:
##   ..$ commune                     : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ total_employed_population   : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ...
##   ..$ of_which_wage_earners       : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ...
##   ..$ of_which_non_wage_earners   : int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ...
##   ..$ unemployed                  : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ...
##   ..$ active_population           : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ...
##   ..$ unemployment_rate_in_percent: num [1:118] 7.46 5.12 5.62 3.23 4.73 ...
##   ..$ year                        : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ unemp_2016:'data.frame':  118 obs. of  8 variables:
##   ..$ commune                     : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
##   ..$ total_employed_population   : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ...
##   ..$ of_which_wage_earners       : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ...
##   ..$ of_which_non_wage_earners   : int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ...
##   ..$ unemployed                  : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ...
##   ..$ active_population           : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ...
##   ..$ unemployment_rate_in_percent: num [1:118] 7.15 5.04 4.84 3.01 4.3 ...
##   ..$ year                        : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...

So now, what if I want to know, for each dataset, which communes have an unemployment rate that is less than, say, 3%? For a single dataset I would do something like this:

one_dataset %>%
  filter(unemployment_rate_in_percent < 3)

But for a list of datasets, map() is needed (and as you will see, that is not all that is needed):

all_datasets %>%
  map(~filter(., unemployment_rate_in_percent < 3))
## $unemp_2013
##      commune total_employed_population of_which_wage_earners
## 1    Garnich                       844                   750
## 2 Leudelange                      1064                   937
## 3       Bech                       526                   463
##   of_which_non_wage_earners unemployed active_population
## 1                        94         25               869
## 2                       127         32              1096
## 3                        63         16               542
##   unemployment_rate_in_percent year
## 1                     2.876870 2013
## 2                     2.919708 2013
## 3                     2.952030 2013
## 
## $unemp_2014
##      commune total_employed_population of_which_wage_earners
## 1    Garnich                       845                   757
## 2 Leudelange                      1102                   965
## 3       Bech                       543                   476
## 4 Flaxweiler                       879                   789
##   of_which_non_wage_earners unemployed active_population
## 1                        88         19               864
## 2                       137         34              1136
## 3                        67         15               558
## 4                        90         27               906
##   unemployment_rate_in_percent year
## 1                     2.199074 2014
## 2                     2.992958 2014
## 3                     2.688172 2014
## 4                     2.980132 2014
## 
## $unemp_2015
##   commune total_employed_population of_which_wage_earners
## 1    Bech                       520                   450
## 2    Bous                       750                   680
##   of_which_non_wage_earners unemployed active_population
## 1                        70         14               534
## 2                        70         22               772
##   unemployment_rate_in_percent year
## 1                     2.621723 2015
## 2                     2.849741 2015
## 
## $unemp_2016
##             commune total_employed_population of_which_wage_earners
## 1 Reckange-sur-Mess                       980                   850
## 2              Bech                       520                   450
## 3          Betzdorf                      1500                  1350
## 4        Flaxweiler                       910                   820
##   of_which_non_wage_earners unemployed active_population
## 1                       130         30              1010
## 2                        70         11               531
## 3                       150         45              1545
## 4                        90         24               934
##   unemployment_rate_in_percent year
## 1                     2.970297 2016
## 2                     2.071563 2016
## 3                     2.912621 2016
## 4                     2.569593 2016

I know what you’re thinking… what the hell?. Let me explain: map() needs a function to map to each element of the list. all_datasets is the list to which I want to map the function. But what function? filter() is the function I need, so why doesn’t:

all_datasets %>%
  map(filter(unemployment_rate_in_percent < 3))

work? This is a bit complicated, and has to do with what is called environments. If you try to run the code above, you will get this error message:

Error in filter(unemployment_rate_in_percent < 3) :
  object 'unemployment_rate_in_percent' not found

I won’t go into details, but by writing ~filter(., unemployment_rate_in_percent < 3), which is a formula (~ is the symbol to define formulas, more on this in the later chapters), map() converts it to a function that it can use. If you want to know more about this, you can read it in Advanced R by Hadley Wickham, but it is an advanced topic.

8.6 Mapping your homebrewed functions to lists of datasets

Before merging these datasets together, we would need them to have a year column indicating the year. It would also be helpful if gave names to these datasets. For this task, we can use purrr::set_names():

all_datasets = set_names(all_datasets, as.character(seq(2013, 2016)))

Let’s take a look at the list now:

str(all_datasets)

As you can see, each data.frame object contained in the list has been renamed. You can thus access them with the $ operator:

8.6.1 Data frames and reduce

Using map() we now know how to apply a function to each dataset of a list. But maybe it would be easier to merge all the datasets first, and then manipulate them? Before that though, I am going to teach you how to use purrr::reduce(), another very powerful function that works on lists. This is a function that you can find in other programming languages, but sometimes it is called fold.

I think that the following example illustrates the power of reduce() well:

numbers = seq(1, 5) # Create a vector with the numbers 1 to 5

reduce(numbers, `+`, .init = 0)
## [1] 15

reduce() takes a function as an argument, here the function +7 and then does the following computation:

0 + numbers[1] + numbers[2] + numbers[3]...

It applies the user supplied function successively but has to start with something, so we give it the argument init also. This argument is actually optional, but I show it here because in some cases it might be useful to start the computations at another value than 0.reduce() generalizes functions that only take two arguments. If you were to write a function that returns the minimum between two numbers:

my_min = function(a, b){
    if(a < b){
        return(a)
    } else {
        return(b)
    }
}

You could use reduce() to get the minimum of a list of numbers:

numbers2 = c(3, 1, -8, 9)

reduce(numbers2, my_min)
## [1] -8

As long as you provide a function and a list of elements to reduce(), you will get a single output. So how could reduce() help us with merging all the datasets that are in the list? dplyr comes with a lot of function to merge two datasets. Remember that I said before that reduce() allows you to generalize a function of two arguments? Let’s try it with our list of datasets:

unemp_lux = reduce(all_datasets, full_join)
## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
## Joining, by = c("commune", "total_employed_population", "of_which_wage_earners", "of_which_non_wage_earners", "unemployed", "active_population", "unemployment_rate_in_percent", "year")
glimpse(unemp_lux)
## Observations: 472
## Variables: 8
## $ commune                      <chr> "Grand-Duche de Luxembourg", "Canto…
## $ total_employed_population    <int> 223407, 17802, 1703, 844, 1431, 409…
## $ of_which_wage_earners        <int> 203535, 15993, 1535, 750, 1315, 380…
## $ of_which_non_wage_earners    <int> 19872, 1809, 168, 94, 116, 294, 272…
## $ unemployed                   <int> 19287, 1071, 114, 25, 74, 261, 98, …
## $ active_population            <int> 242694, 18873, 1817, 869, 1505, 435…
## $ unemployment_rate_in_percent <dbl> 7.947044, 5.674773, 6.274078, 2.876…
## $ year                         <int> 2013, 2013, 2013, 2013, 2013, 2013,…

full_join() is one of the dplyr function that merges data. There are others that might be useful depending on the kind of join operation you need. Let’s write this data to disk as we’re going to keep using it for the next chapters:

export(unemp_lux, "datasets/unemp_lux.csv")

8.7 Functional programming and plotting

In this section, we are going to learn how to use the possibilities offered by the purrr package and how it can work together with ggplot2 to generate many plots. This is a more advanced topic, but what comes next is also what makes R, and the functional programming paradigm so powerful.

For example, suppose that instead of wanting a single plot with the unemployment rate of each commune, you need one unemployment plot, per commune:

unemp_lux_data %>%
  filter(division == "Luxembourg") %>%
  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
  theme_minimal() +
  labs(title = "Unemployment in Luxembourg", x = "Year", y = "Rate") +
  geom_line()

and then you would write the same for “Esch-sur-Alzette” and also for “Wiltz”. If you only have to make to make these 3 plots, copy and pasting the above lines is no big deal:

unemp_lux_data %>%
  filter(division == "Esch-sur-Alzette") %>%
  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
  theme_minimal() +
  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
  geom_line()

unemp_lux_data %>%
  filter(division == "Wiltz") %>%
  ggplot(aes(year, unemployment_rate_in_percent, group = division)) +
  theme_minimal() +
  labs(title = "Unemployment in Esch-sur-Alzette", x = "Year", y = "Rate") +
  geom_line()

Put copy and pasting is error prone. Can you spot the copy-paste mistake I made? And what if you have to create the above plots for all 108 Luxembourguish communes? That’s a lot of copy pasting. What if, once you are done copy pasting, you have to change something, for example, the theme? You could use the search and replace function of RStudio, true, but sometimes search and replace can also introduce bugs and typos. You can avoid all these issues by using purrr::map(). What do you need to map over? The commune names. So let’s create a vector of commune names:

communes <- list("Luxembourg", "Esch-sur-Alzette", "Wiltz")

Now we can create the graphs using map(), or map2() to be exact:

plots_tibble = unemp_lux_data %>%
  filter(division %in% communes) %>%
  group_by(division) %>%
  nest() %>%
  mutate(plot = map2(.x = data, .y = division, ~ggplot(data = .x) +
       theme_minimal() +
       geom_line(aes(year, unemployment_rate_in_percent, group = 1)) +
       labs(title = paste("Unemployment in", .y))))

Let’s study this line by line: the first line is easy, we simply use filter() to keep only the communes we are interested in. Then we group by division and use tidyr::nest(). As a refresher, let’s take a look at what this does:

unemp_lux_data %>%
  filter(division %in% communes) %>%
  group_by(division) %>%
  nest()
## # A tibble: 3 x 2
##   division         data             
##   <chr>            <list>           
## 1 Esch-sur-Alzette <tibble [15 × 7]>
## 2 Luxembourg       <tibble [15 × 7]>
## 3 Wiltz            <tibble [15 × 7]>

This creates a tibble with two columns, division and data, where each individual (or commune in this case) is another tibble with all the original variables. This is very useful, because now we can pass these tibbles to map2(), to generate the plots. But why map2() and what’s the difference with map()? map2() works the same way as map(), but maps over two inputs:

numbers1 <- list(1, 2, 3, 4, 5)

numbers2 <- list(9, 8, 7, 6, 5)

map2(numbers1, numbers2, `*`)
## [[1]]
## [1] 9
## 
## [[2]]
## [1] 16
## 
## [[3]]
## [1] 21
## 
## [[4]]
## [1] 24
## 
## [[5]]
## [1] 25

In our example with the graphs, the two inputs are the data, and the names of the communes. This is useful to create the title with labs(title = paste("Unemployment in", .y)))) where .y is the second input of map2(), the commune names contained in variable division.

So what happened? We now have a tibble called plots_tibble that looks like this:

print(plots_tibble)
## # A tibble: 3 x 3
##   division         data              plot    
##   <chr>            <list>            <list>  
## 1 Esch-sur-Alzette <tibble [15 × 7]> <S3: gg>
## 2 Luxembourg       <tibble [15 × 7]> <S3: gg>
## 3 Wiltz            <tibble [15 × 7]> <S3: gg>

This tibble contains three columns, division, data and now a new one called plot, that we created before using the last line mutate(plot = ...) (remember that mutate() adds columns to tibbles). plot is a list-column, with elements… being plots! Yes you read that right, the elements of the column plot are literally plots. This is what I meant with list columns. Let’s see what is inside the data and the plot columns exactly:

plots_tibble %>%
  pull(data)
## [[1]]
## # A tibble: 15 x 7
##     year active_populati… of_which_non_wa… of_which_wage_e…
##    <int>            <dbl>            <dbl>            <dbl>
##  1  2001             11.3              665             10.1
##  2  2002             11.7              677             10.3
##  3  2003             11.7              674             10.2
##  4  2004             12.2              659             10.6
##  5  2005             11.9              654             10.3
##  6  2006             12.2              657             10.5
##  7  2007             12.6              634             10.9
##  8  2008             12.9              638             11.0
##  9  2009             13.2              652             11.0
## 10  2010             13.6              638             11.2
## 11  2011             13.9              630             11.5
## 12  2012             14.3              684             11.8
## 13  2013             14.8              694             12.0
## 14  2014             15.2              703             12.5
## 15  2015             15.3              710             12.6
## # … with 3 more variables: total_employed_population <dbl>,
## #   unemployed <dbl>, unemployment_rate_in_percent <dbl>
## 
## [[2]]
## # A tibble: 15 x 7
##     year active_populati… of_which_non_wa… of_which_wage_e…
##    <int>            <dbl>            <dbl>            <dbl>
##  1  2001             34.4             2.89             30.4
##  2  2002             34.8             2.94             30.3
##  3  2003             35.2             3.03             30.1
##  4  2004             35.6             3.06             30.1
##  5  2005             35.6             3.13             29.8
##  6  2006             35.5             3.12             30.3
##  7  2007             36.1             3.25             31.1
##  8  2008             37.5             3.39             31.9
##  9  2009             37.9             3.49             31.6
## 10  2010             38.6             3.54             32.1
## 11  2011             40.3             3.66             33.6
## 12  2012             41.8             3.81             34.6
## 13  2013             43.4             3.98             35.5
## 14  2014             44.6             4.11             36.7
## 15  2015             45.2             4.14             37.5
## # … with 3 more variables: total_employed_population <dbl>,
## #   unemployed <dbl>, unemployment_rate_in_percent <dbl>
## 
## [[3]]
## # A tibble: 15 x 7
##     year active_populati… of_which_non_wa… of_which_wage_e…
##    <int>            <dbl>            <dbl>            <dbl>
##  1  2001             2.13              223             1.79
##  2  2002             2.14              220             1.78
##  3  2003             2.18              223             1.79
##  4  2004             2.24              227             1.85
##  5  2005             2.26              229             1.85
##  6  2006             2.20              206             1.82
##  7  2007             2.27              198             1.88
##  8  2008             2.30              200             1.90
##  9  2009             2.36              201             1.94
## 10  2010             2.42              195             1.97
## 11  2011             2.48              190             2.02
## 12  2012             2.59              188             2.10
## 13  2013             2.66              195             2.15
## 14  2014             2.69              185             2.19
## 15  2015             2.77              180             2.27
## # … with 3 more variables: total_employed_population <dbl>,
## #   unemployed <dbl>, unemployment_rate_in_percent <dbl>

each element of data is a tibble for the specific country with columns year, active_population, etc, the original columns. But obviously, there is no division column. So to plot the data, and join all the dots together, we need to add group = 1 in the call to ggplot2() (whereas if you plot multiple lines in the same graph, you need to write group = division).

But more interestingly, how can you actually see the plots? If you want to simply look at them, it is enough to use pull():

plots_tibble %>%
  pull(plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

And if we want to save these plots, we can do so using map2():

map2(paste0(plots_tibble$division, ".pdf"), plots_tibble$plot, ggsave)
Saving 7 x 5 in image
Saving 6.01 x 3.94 in image
Saving 6.01 x 3.94 in image

This was probably the most advanced topic we have studied yet; but you probably agree with me that it is among the most useful ones. This section is a perfect illustration of the power of functional programming; you can mix and match functions as long as you give them the correct arguments. You can pass data to functions that use data and then pass these functions to other functions that use functions as arguments, such as map().8 map() does not care if the functions you pass to it produces tables, graphs or even another function. map() will simply map this function to a list of inputs, and as long as these inputs are correct arguments to the function, map() will do its magic. If you combine this with list-columns, you can even use map() alongside dplyr functions and map your function by first grouping, filtering, etc…

8.8 Exercises

  1. Suppose you have an Excel workbook that contains data on three sheets. Create a function that reads entire workbooks, and that returns a list of tibbles, where each tibble is the data of one sheet (download the example Excel workbook, example_workbook.xlsx, from the assets folder on the books github).

  2. Use one of the map() functions to combine two lists into one. Consider the following two lists:

mediterranean <- list("starters" = list("humous", "lasagna"), "dishes" = list("sardines", "olives"))

continental <- list("starters" = list("pea soup", "terrine"), "dishes" = list("frikadelle", "sauerkraut"))

The result we’d like to have would look like this:

$starters
$starters[[1]]
[1] "humous"

$starters[[2]]
[1] "olives"

$starters[[3]]
[1] "pea soup"

$starters[[4]]
[1] "terrine"


$dishes
$dishes[[1]]
[1] "sardines"

$dishes[[2]]
[1] "lasagna"

$dishes[[3]]
[1] "frikadelle"

$dishes[[4]]
[1] "sauerkraut"

  1. This is simply the + operator you’re used to. Try this out: `+`(1, 5) and you’ll see + is a function like any other. You just have to write backticks around the plus symbol to make it work.

  2. Functions that have other functions as input are called higher order functions