Descriptive data analysis

library(tidyverse)
library(aqm)
library(readr)

Introduction

There are three elements to analysing any data set.

Exploratory data analysis
Statistical model fitting
Presenting results

These elements do overlap to some extent in any work flow. There is no clear boundary between exploratory analysis and statistical modelling as statistical modelling itself is often used for exploration. However you will tend to find yourself working through these phases on the path to writing up any analysis.

Exploratory, descriptive data analysis is a vital first step. We will look at the data on sleep in mammals (Allison and Cicchetti 1976). The data frame has been slightly modified to include a column on diet.

Video 1:Picking up where we left off

Video 2:Starting a new markdown document

Reproducible research

The great advantage in using R code for data analysis as opposed to a graphical user interface approach is reproducibility and transferability. The documents you are provided with on the course have all been written in markdown and compiled on the server. So the code has been tested and will produce the results shown. In order to produce the same results you do not need to remember any R at all. Carefully copying and pasting the code from the instructions document into the document you are working on yourself will produce the same results when the markdown is compiled.

Video 3:Reproducible research: Reusing code

Video 4: Loading packages

When we are using R we almost always load packages before starting work. A package bundles together code, data, documentation, and tests, and is easy to share with others. There are over 20,000 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. This variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.

Video 5: Editing opts_chunk$set

A markdown document usually starts with a hidden code chunk that sets some options. This can be carefully edited to prevent chunks displaying unwanted messages and even to switch the visibility of the code on and off.

Loading in the data

Let’s load our data back into R

d<-read_csv("sleep.csv")

## Parsed with column specification:
## cols(
##   Species = col_character(),
##   BodyWt = col_double(),
##   BrainWt = col_double(),
##   Sleep = col_double(),
##   MaxAge = col_double(),
##   Gestation = col_double(),
##   Diet = col_character()
## )

I’ve called the data frame “d” here. This makes typing faster and is accepted practice if the analysis is only based on a single data frame. If an analysis uses several tables of data at the same time then it is better to use informative names for each one.

Video 6: Loading data and running code chunks in order

Video 7: Looking at the data.

When data are loaded into R they are held in memory. You can see data objects in the environment pane in the top right of the interface. Clicking on a data object brings up a spreadsheet like table of the data that you can look through, search and sort. Unlike working with a spreadsheet you can’t change the data. That is not the R way of working. You make changes to the data in R by typing in code. This is simply a handy way of checking your data. As the columns can be sorted by clicking on the top row it is easy to find the maximum and minimum values.

The structure of the data frame

There are two ways of quickly understanding the data. The simplest is just to click on the data frame in the environment pane. This pops up the data in the form of spreadsheet like table which shows on the top left tab. When you have finished looking at the data, close down the tab.

The other way of finding out information about the data is to ask R directly. The str command produces information about the structure.

str(d)

## tibble [53 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Species  : chr [1:53] "African elephant" "African giant pouched rat" "Arctic Fox" "Arctic ground squirrel" ...
##  $ BodyWt   : num [1:53] 6654 1 3.38 0.92 2547 ...
##  $ BrainWt  : num [1:53] 5712 6.6 44.5 5.7 4603 ...
##  $ Sleep    : num [1:53] 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
##  $ MaxAge   : num [1:53] 38.6 4.5 14 5 69 27 19 30.4 28 50 ...
##  $ Gestation: num [1:53] 645 42 60 25 624 180 35 392 63 230 ...
##  $ Diet     : chr [1:53] "HERB" "GRAN" "CARN" "GRAN" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Species = col_character(),
##   ..   BodyWt = col_double(),
##   ..   BrainWt = col_double(),
##   ..   Sleep = col_double(),
##   ..   MaxAge = col_double(),
##   ..   Gestation = col_double(),
##   ..   Diet = col_character()
##   .. )

A data frame is the most commonly used data object in R. Whenever you load in data from a spreadsheet into R you get a data frame. A data frame is a table with rows and columns. Each column corresponds to a variable. The str command provides information on the nature of the variable.

The summary command is very handy if you want to find the summary statistics for all the variables at once. These statistics include the maximum value, minimum value, mean, median and the interquartile ranges.

summary(d)

##    Species              BodyWt            BrainWt            Sleep      
##  Length:53          Min.   :   0.005   Min.   :   0.14   Min.   : 2.60  
##  Class :character   1st Qu.:   0.480   1st Qu.:   3.50   1st Qu.: 8.30  
##  Mode  :character   Median :   3.300   Median :  17.50   Median :10.60  
##                     Mean   : 216.958   Mean   : 306.59   Mean   :10.74  
##                     3rd Qu.:  52.160   3rd Qu.: 169.00   3rd Qu.:13.20  
##                     Max.   :6654.000   Max.   :5712.00   Max.   :19.90  
##      MaxAge         Gestation         Diet          
##  Min.   :  2.00   Min.   : 12.0   Length:53         
##  1st Qu.:  5.00   1st Qu.: 35.0   Class :character  
##  Median : 14.00   Median : 63.0   Mode  :character  
##  Mean   : 19.74   Mean   :126.8                     
##  3rd Qu.: 28.00   3rd Qu.:164.0                     
##  Max.   :100.00   Max.   :645.0

Video 8: Talking about data exploration

Simple plots and figures

There are many different ways of making figures in R. The gglots package has become the standard for presentations and publications. This produces modern, professional looking figures. For simple data exploration the original base graphs that form part of the underlying R language are useful.

These sort of figures use syntax that is easy to remember and quick to type. The tend to work on only one or two variables at once. The results are not usually very elegant, but they do help to quickly explore the data.

Histograms and boxplots

Video 9: Forming a simple histogram

The code for forming a very simple exploratory histogram is very simple.

Notice the way the column that you want to plot is specified by the name of the data frame (d) followed by a dollar sign then the name of the column. This is standard R syntax.
To build your own analysis in your own markdown document you can either carefully copy and paste code from other sources (these handouts for example) or you can write your own code. Be very careful to always run code chunks in order. If, for example, you had not loaded in the data frame earlier in the analysis and tried to run the code you would get a an error of the form Error in hist(x) : object ‘x’ not found

hist(d$BodyWt)

Video 9: Copying and pasting code into a markdown document.

Video 10: Running a plot command in the console

Video 11: Producing a boxplot and spotting skew

boxplot(d$BodyWt)

These plots show up a very clear statistical feature of this variable, which you could also have seen in the summary statistics. The variable is very strongly right skewed.

Video 11: Log transforming a variable

When we are working with data we often want to produce new columns containing new variables. R has many thousands of functions for working with data. If you want to carry out any mathematical operation at all on your data you can do it in R. The difficulty may be knowing which operation you require and finding the right syntax. In the case of a logarithmic transform it is fairly simple. However R uses log to refer to natural logarithms (other software tends to use ln). If we want logarithms to the base 10 we need log10.

The assignment operator in R is an arrow. You can think of this as taking the values on the right and sending them into the object on the left. If the variable does not yet exist it will be created.

d$logBodyWt<-log10(d$BodyWt)

The result of taking logarithms. The new variable has much better statistical properties.

hist(d$logBodyWt)

Plotting two variables

Base R is able to adapt a single command to plot two variables according to the nature of the variables involved. So if we plot a (perhaps suitably transformed) variable against a factor we will get boxplots.

When we read the data in using read_csv and asked for the structure R told us that Diet was a character vector. In order to carry out analysis on such data we have to coerce it to a factor. This is a single line of code in R.

d$Diet<-as.factor(d$Diet)

The older, base R command (not run here) uses a dot instead of an underscore and automatically carries out this step for you.

# Base R uses a dot not an underscore. read.csv not read_csv
# d<-read.csv("sleep_csv")
# Note that code that has a # before it is not run. This can be used to deactivate code that you still want to include in a document for reference purposes

However the change implemented in the readr package was made because when more complex data is read in this default behaviour can be undedesirable. As RStudio prompts you to use read_csv this had to be pointed out. It can cause confusion! Sorry about this. So do always check your data using str.

plot(d$logBodyWt ~ d$Diet)

Plotting a numerical variable against another numerical variable produces a scatterplot.

plot(d$BodyWt~d$BrainWt)

This shows up the problem with using the untronsformed variables in this case. In order to analyse these data you will need to log transform more than one variable.

Exercise

Make histograms of all the other variables in the data set. What do you find? Log transform the variables which may require transformation. Replot the data. Write up the exercise using annotated R code as a markdown document.

References

Allison, T, and DV Cicchetti. 1976. “Sleep in Mammals: Ecological and Constitutional Correlates.” Science 194 (4266). American Association for the Advancement of Science: 732–34. doi:10.1126/science.982039.