Chapter 5 Loading data for exploratory analysis

5.1 Introduction

There are three elements to analysing any data set.

Exploratory data analysis
Statistical model fitting
Presenting the final results

These elements do overlap to some extent in any work flow. There is no clear boundary between exploratory analysis and statistical modelling, as statistical modelling itself is often used for exploration. However you will tend to find yourself working through these phases on the path to writing up any analysis.

Exploratory, descriptive data analysis is a vital first step. We will look at the data on sleep in mammals (Allison and Cicchetti 1976). The data frame has been slightly modified to include a column on diet.

5.1.1 Video 1:Picking up where we left off

5.1.2 Video 2:Starting a new markdown document

5.2 Reproducible research

The great advantage in using R code for data analysis as opposed to a graphical user interface approach is reproducibility and transferability. The documents you are provided with on the course have all been written in markdown and compiled on the server. So the code has been tested and will produce the results shown. In order to produce the same results you do not need to remember any R at all. Carefully copying and pasting the code from the instructions document into the document you are working on yourself will produce the same results when the markdown is compiled.

5.2.1 Video 3:Reproducible research: Reusing code

5.2.2 Video 4: Loading packages

When we are using R we almost always load packages before starting work. A package bundles together code, data, documentation, and tests, and is easy to share with others. There are over 20,000 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. This variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.

5.2.3 Video 5: Editing opts_chunk$set

A markdown document usually starts with a hidden code chunk that sets some options. This can be carefully edited to prevent chunks displaying unwanted messages and even to switch the visibility of the code on and off.

Please be very careful if you edit this chunk. Any error will prevent the document from knitting.

5.3 Loading in the data

Let’s load our data back into R

I’ve called the data frame “d” here. This makes typing faster and is accepted practice if the analysis is only based on a single data frame. If an analysis uses several tables of data at the same time then it is better to use informative names for each one.

A data frame is a table containing multiple columns. Each column is a variable. Columns containing names are categorical variables and will usually be referred to as factors in R.

5.3.1 Video 6: Loading data and running code chunks in order

5.3.2 Video 7: Looking at the data.

When data are loaded into R they are held in memory. You can see data objects in the environment pane in the top right of the interface. Clicking on a data object brings up a spreadsheet like table of the data that you can look through, search and sort. Unlike working with a spreadsheet you can’t change the data. That is not the R way of working. You make changes to the data in R by typing in code. This is simply a handy way of checking your data. As the columns can be sorted by clicking on the top row it is easy to find the maximum and minimum values.

5.3.3 The structure of the data frame

There are two ways of quickly understanding the data. The simplest is just to click on the data frame in the environment pane. This pops up the data in the form of spreadsheet like table which shows on the top left tab. When you have finished looking at the data, close down the tab.

The other way of finding out information about the data is to ask R directly. The str command produces information about the structure.

## spec_tbl_df [53 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Species  : chr [1:53] "African elephant" "African giant pouched rat" "Arctic Fox" "Arctic ground squirrel" ...
##  $ BodyWt   : num [1:53] 6654 1 3.38 0.92 2547 ...
##  $ BrainWt  : num [1:53] 5712 6.6 44.5 5.7 4603 ...
##  $ Sleep    : num [1:53] 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
##  $ MaxAge   : num [1:53] 38.6 4.5 14 5 69 27 19 30.4 28 50 ...
##  $ Gestation: num [1:53] 645 42 60 25 624 180 35 392 63 230 ...
##  $ Diet     : chr [1:53] "HERB" "GRAN" "CARN" "GRAN" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Species = col_character(),
##   ..   BodyWt = col_double(),
##   ..   BrainWt = col_double(),
##   ..   Sleep = col_double(),
##   ..   MaxAge = col_double(),
##   ..   Gestation = col_double(),
##   ..   Diet = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

A data frame is the most commonly used data object in R. Whenever you load in data from a spreadsheet into R you get a data frame. A data frame is a table with rows and columns. Each column corresponds to a variable. The str command provides information on the nature of the variable.

The summary command is very handy if you want to find the summary statistics for all the variables at once. These statistics include the maximum value, minimum value, mean, median and the interquartile ranges.

##    Species              BodyWt            BrainWt            Sleep      
##  Length:53          Min.   :   0.005   Min.   :   0.14   Min.   : 2.60  
##  Class :character   1st Qu.:   0.480   1st Qu.:   3.50   1st Qu.: 8.30  
##  Mode  :character   Median :   3.300   Median :  17.50   Median :10.60  
##                     Mean   : 216.958   Mean   : 306.59   Mean   :10.74  
##                     3rd Qu.:  52.160   3rd Qu.: 169.00   3rd Qu.:13.20  
##                     Max.   :6654.000   Max.   :5712.00   Max.   :19.90  
##      MaxAge         Gestation         Diet          
##  Min.   :  2.00   Min.   : 12.0   Length:53         
##  1st Qu.:  5.00   1st Qu.: 35.0   Class :character  
##  Median : 14.00   Median : 63.0   Mode  :character  
##  Mean   : 19.74   Mean   :126.8                     
##  3rd Qu.: 28.00   3rd Qu.:164.0                     
##  Max.   :100.00   Max.   :645.0

5.3.4 Video 8: Talking about data exploration

References

Allison, T, and DV Cicchetti. 1976. “Sleep in Mammals: Ecological and Constitutional Correlates.” Science 194 (4266): 732–34. https://doi.org/10.1126/science.982039.