Chapter 4 Reading and writing data
In this chapter, we are going to import example datasets that are available in R, mtcars
and
iris
. I have converted these datasets into several formats. Download those datasets
here if you want to follow the
examples below. R can import some formats without the need of external packages, such as the .csv
format. However, for other formats, you will need to use different packages. Because there are a
lot of different formats available I suggest you use the {rio}
package.
{rio}
is a wrapper around different packages that import/export data in different formats.
This package is nice because you don’t need to remember which package to use to import, say,
STATA datasets and then you need to remember which one for SAS datasets, and so on. Read {rio}
’s
vignette for more details. Below
I show some of {rio}
’s functions presented in the vignette. It is also possible to import data from
other, less “traditional” sources, such as your clipboard. Also note that it is possible to import
more than one dataset at once. There are two ways of doing that, either by importing all the
datasets, binding their rows together and add a new variable with the name of the data, or import
all the datasets into a list, where each element of that list is a data frame. We are going to
explore this second option later.
4.1 The swiss army knife of data import and export: {rio}
To import data with {rio}
, import()
is all you need:
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
import()
needs the path to the data, and you can specify additional options if needed. On a
Windows computer, you have to pay attention to the path; you cannot simply copy and paste it, because
paths in Windows use the \
symbol whereas R uses /
(just like on Linux or macOS).
Importing a STATA or a SAS file is done just the same:
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
It is also possible to import Excel files where each sheet is a single table, but you will need
import_list()
for that. The file multi.xlsx
has two sheets, each with a table in it:
## List of 2
## $ mtcars:'data.frame': 32 obs. of 11 variables:
## ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## ..$ disp: num [1:32] 160 160 108 258 360 ...
## ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
## ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
## $ iris :'data.frame': 150 obs. of 5 variables:
## ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## ..$ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
As you can see multi
is a list of datasets. Told you lists were very flexible! It is also possible
to import all the datasets in a single directory at once. For this, you first need a vector of paths:
Sys.glob()
allows you to find files using a regular expression. "datasets/unemployment/*.csv"
matches all the .csv
files inside the “datasets/unemployment/” folder.
## List of 4
## $ unemp_2013:'data.frame': 118 obs. of 8 variables:
## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ Total employed population : int [1:118] 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ...
## ..$ of which: Wage-earners : int [1:118] 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ...
## ..$ of which: Non-wage-earners: int [1:118] 19872 1809 168 94 116 294 272 113 189 338 ...
## ..$ Unemployed : int [1:118] 19287 1071 114 25 74 261 98 45 66 207 ...
## ..$ Active population : int [1:118] 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ...
## ..$ Unemployment rate (in %) : num [1:118] 7.95 5.67 6.27 2.88 4.92 ...
## ..$ Year : int [1:118] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ unemp_2014:'data.frame': 118 obs. of 8 variables:
## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ Total employed population : int [1:118] 228423 18166 1767 845 1505 4129 2172 1007 1268 3124 ...
## ..$ of which: Wage-earners : int [1:118] 208238 16366 1606 757 1390 3840 1897 887 1082 2782 ...
## ..$ of which: Non-wage-earners: int [1:118] 20185 1800 161 88 115 289 275 120 186 342 ...
## ..$ Unemployed : int [1:118] 19362 1066 122 19 66 287 91 38 61 202 ...
## ..$ Active population : int [1:118] 247785 19232 1889 864 1571 4416 2263 1045 1329 3326 ...
## ..$ Unemployment rate (in %) : num [1:118] 7.81 5.54 6.46 2.2 4.2 ...
## ..$ Year : int [1:118] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ unemp_2015:'data.frame': 118 obs. of 8 variables:
## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ Total employed population : int [1:118] 233130 18310 1780 870 1470 4130 2170 1050 1300 3140 ...
## ..$ of which: Wage-earners : int [1:118] 212530 16430 1620 780 1350 3820 1910 920 1100 2770 ...
## ..$ of which: Non-wage-earners: int [1:118] 20600 1880 160 90 120 310 260 130 200 370 ...
## ..$ Unemployed : int [1:118] 18806 988 106 29 73 260 80 41 72 169 ...
## ..$ Active population : int [1:118] 251936 19298 1886 899 1543 4390 2250 1091 1372 3309 ...
## ..$ Unemployment rate (in %) : num [1:118] 7.46 5.12 5.62 3.23 4.73 ...
## ..$ Year : int [1:118] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ unemp_2016:'data.frame': 118 obs. of 8 variables:
## ..$ Commune : chr [1:118] "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## ..$ Total employed population : int [1:118] 236100 18380 1790 870 1470 4160 2160 1030 1330 3150 ...
## ..$ of which: Wage-earners : int [1:118] 215430 16500 1640 780 1350 3840 1900 900 1130 2780 ...
## ..$ of which: Non-wage-earners: int [1:118] 20670 1880 150 90 120 320 260 130 200 370 ...
## ..$ Unemployed : int [1:118] 18185 975 91 27 66 246 76 35 70 206 ...
## ..$ Active population : int [1:118] 254285 19355 1881 897 1536 4406 2236 1065 1400 3356 ...
## ..$ Unemployment rate (in %) : num [1:118] 7.15 5.04 4.84 3.01 4.3 ...
## ..$ Year : int [1:118] 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
in a subsequent chapter we will learn how to actually use these lists of datasets.
If you know that each dataset in each file has the same colmuns, you can also import them directly
into a single dataset by binding each dataset together using rbind = TRUE
:
## 'data.frame': 472 obs. of 9 variables:
## $ Commune : chr "Grand-Duche de Luxembourg" "Canton Capellen" "Dippach" "Garnich" ...
## $ Total employed population : int 223407 17802 1703 844 1431 4094 2146 971 1218 3002 ...
## $ of which: Wage-earners : int 203535 15993 1535 750 1315 3800 1874 858 1029 2664 ...
## $ of which: Non-wage-earners: int 19872 1809 168 94 116 294 272 113 189 338 ...
## $ Unemployed : int 19287 1071 114 25 74 261 98 45 66 207 ...
## $ Active population : int 242694 18873 1817 869 1505 4355 2244 1016 1284 3209 ...
## $ Unemployment rate (in %) : num 7.95 5.67 6.27 2.88 4.92 ...
## $ Year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ _file : chr "datasets/unemployment/unemp_2013.csv" "datasets/unemployment/unemp_2013.csv" "datasets/unemployment/unemp_2013.csv" "datasets/unemployment/unemp_2013.csv" ...
## - attr(*, ".internal.selfref")=<externalptr>
This also add a further column called _file
indicating the name of the file that contained the
original data.
If something goes wrong, you might need to take a look at the underlying function {rio}
is
actually using to import the file. Let’s look at the following example:
## mpg&cyl&disp&hp&drat&wt&qsec&vs&am&gear&carb
## 1 21&6&160&110&3.9&2.62&16.46&0&1&4&4
## 2 21&6&160&110&3.9&2.875&17.02&0&1&4&4
## 3 22.8&4&108&93&3.85&2.32&18.61&1&1&4&1
## 4 21.4&6&258&110&3.08&3.215&19.44&1&0&3&1
## 5 18.7&8&360&175&3.15&3.44&17.02&0&0&3&2
## 6 18.1&6&225&105&2.76&3.46&20.22&1&0&3&1
as you can see, the import didn’t work quite well! This is because the separator is the &
for
some reason. Because we are trying to read a .csv
file, rio::import()
is using
data.table::fread()
under the hood (you can read this in import()
’s help). If you then read
data.table::fread()
’s help, you see that the fread()
function has an optional sep =
argument
that you can use to specify the separator. You can use this argument in import()
too, and it will
be passed down to data.table::fread()
:
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
export()
allows you to write data to disk, by simply providing the path and name of the file you
wish to save.
If you end the name with .csv
the file is exported to the csv format, if instead you write .dta
the data will be exported to the STATA format, and so on.
If you wish to export to Excel, this is possible, but it may require that you change a file on your computer (you only have to do this once). Try running:
if this results in an error, try the following:
- Run the following lines in Rstudio:
if(!file.exists("~/.Rprofile")) # only create if not already there
file.create("~/.Rprofile") # (don't overwrite it)
file.edit("~/.Rprofile")
These lines, taken shamelessly from Efficient R
programming (go read it,
it’s a very great resource) look for and open the .Rprofile
file which is a file that is run
every time you open Rstudio. This means that you can put any line of code there that will always be
executed whenever you launch Rstudio.
- Add this line to the file:
This tells Rstudio to use zip.exe
as the default zip tool, which is needed to export files to the
Excel format. Try it out by restarting Rstudio, and then running the following lines:
You should find the mtcars.xlsx
inside your working directory. You can check what is your working
directory with getwd()
.
{rio}
should cover all your needs, but if not, there is very likely a package out there that will
import the data you need.
4.2 Writing any object to disk
{rio}
is an amazing package, but is only able to write tabular representations of data. What if you
would like to save, say, a list containing any arbitrary object? This is possible with the
saveRDS()
function. Literally anything can be saved with saveRDS()
:
my_list <- list("this is a list", list("which contains a list", 12), c(1, 2, 3, 4), matrix(c(2, 4,
3, 1, 5, 7), nrow = 2))
str(my_list)
## List of 4
## $ : chr "this is a list"
## $ :List of 2
## ..$ : chr "which contains a list"
## ..$ : num 12
## $ : num [1:4] 1 2 3 4
## $ : num [1:2, 1:3] 2 4 3 1 5 7
my_list
is a list containing a string, a list which contains a string and a number, a vector and
a matrix… Now suppose that computing this list takes a very long time. For example, imagine that
each element of the list is the result of estimating a very complex model on a simulated
dataset, which takes hours to simulate. Because this takes so long to compute, you’d want to save
it to disk. This is possible with saveRDS()
:
The next day, after having freshly started your computer and launched RStudio, it is possible to
retrieve the object exactly like it was using readRDS()
:
## List of 4
## $ : chr "this is a list"
## $ :List of 2
## ..$ : chr "which contains a list"
## ..$ : num 12
## $ : num [1:4] 1 2 3 4
## $ : num [1:2, 1:3] 2 4 3 1 5 7
Even if you want to save a regular dataset, using saveRDS()
might be a good idea because the data
gets compressed if you add the option compress = TRUE
to saveRDS()
. However keep in mind that
this will only be readable by R, so if you need to share this data with colleagues that use another
tool, save it in another format.
4.3 Using RStudio projects to manage paths
Managing paths can be painful, especially if you’re collaborating with a colleague and both of you saved the data in paths that are different. Whenever one of you wants to work on the script, the path will need to be adapted first. The best way to avoid that is to use projects with RStudio.
Imagine that you are working on a project entitled “housing”. You will create a folder called “housing” somewhere on your computer and inside this folder have another folder called “data”, then a bunch of other folders containing different files or the outputs of your analysis. What matters here is that you have a folder called “data” which contains the datasets you will ananlyze. When you are inside an RStudio project, granted that you chose your “housing” folder as the folder to host the project, you can read the data by simply specifying the path like so:
Constrast this to what you would need to write if you were not using a project:
Not only is that longer, but if Castor is working on this project with Pollux, Pollux would need to change the above line to this:
whenever Pollux needs to work on it. Another, similar issue, is that if you need to write something to disk, such as a dataset or a plot, you would also need to specify the whole path:
If you forget to write the whole path, then the dataset will be saved in the standard working
directory, which is your “My Documents” folder on Windows, and “Home” on GNU+Linux or macOS. You
can check what is the working directory with the getwd()
function:
On a fresh session on my computer this returns:
"/home/bruno"
or, on Windows:
"C:/Users/Bruno/Documents"
but if you call this function inside a project, it will return the path to your project. It is also
possible to set the working directory with setwd()
, so you don’t need to always write the full
path, meaning that you can this:
instead of:
import("the/path/I/want/data/my_data.csv")
export(processed_data, "the/path/I/want/processed_data.xlsx")
However, I really, really, really urge you never to use setwd()
. Use projects instead!
Using projects saves a lot of pain in the long run.