Chapter 7 More exploratory data analysis

Here is another example data set for exploration.

7.1 Step one. Load any packages needed

The simplest exploratory techniques just use the functions available in the base R language. So there is no need for a long list of packages to be loaded at this stage. Some data is included in my aqm package on the server. So that does need to be added.

Note that I always include the tidyverse library. This is essential for manipulating data and forming plots using more modern methods that have been added to R since its inception.

library(tidyverse)
library(aqm)

7.2 Load the data

The next step is to load in the data. This will usually involve reading a table of data from a csv file or a correctly formatted spreadsheet. In this case there is data in the aqm package that we can use.

data(mussels)

This loads the mussels data. If you look in the environment pane you will see a data frame. Click on it to visualise it directly.

To save typing when working with one single table I often assign it to a data frame just called d. This is a personal style. Not everyone does this. You can type out mussels to refer to the data if you wish.

d<-mussels

7.3 Structure of the data

The str command results in a description of the structure of the data

str(d)
## 'data.frame':    113 obs. of  3 variables:
##  $ Lshell  : num  122.1 100.1 100.7 102.3 94.9 ...
##  $ BTVolume: int  39 21 23 22 20 22 21 18 21 15 ...
##  $ Site    : Factor w/ 6 levels "Site_1","Site_2",..: 6 6 6 6 6 6 6 6 6 6 ...

We have a data frame consisting of a numerical variable called Lshell, a numerical (integer) variable called BTVolume and a factor with six levels called Site.

At this point you would refer to the “metadata” (data about data) to check the definition of these variables and find out the units of measurement, site characteristics etc.

str(d)
## 'data.frame':    113 obs. of  3 variables:
##  $ Lshell  : num  122.1 100.1 100.7 102.3 94.9 ...
##  $ BTVolume: int  39 21 23 22 20 22 21 18 21 15 ...
##  $ Site    : Factor w/ 6 levels "Site_1","Site_2",..: 6 6 6 6 6 6 6 6 6 6 ...

Note how the data consists of three variables. Two of them are numeric and one is a factor.

Data such as this should be provided with metadata, in other words data about data that describe the variables. In this case the shell lengths are in mm and the BTVolume is the volume of the body of the mussel inside the shell in ml.

You should realise that although the data comes from six sites, all the data is held together as a single table. When students collect their own data they often do not do this. Instead they use a spreadsheet with a tab for each of the two variables. This is known as wide format and is bad practice. You should avoid collecting data in wide format as it can be quite difficult to turn into a data frame. Wide format looks like this.

7.4 Wide format

d %>% group_by(Site) %>% mutate(id=1:n())->dd
dd %>% dplyr::select(id,Site, Lshell) %>%
pivot_wider(names_from = Site,values_from = Lshell) ->dd
dt(dd)

Notice the problem. Not only is this format awkward to use and store in Excel as it leads to multiple sheets, the row numbers are not balanced. This made it quite tricky to form the wide table in R which is why three lines of code were needed.

7.5 Pivoting to long

If you do read data in R in wide format it can be turned back into a data frame by pivoting.

dd %>% pivot_longer(-1, values_drop_na = TRUE,names_to = "Site",values_to = "Lshell") ->dd
dt(dd)

7.6 Avoid producing data in wide format

Think carefully before collecting your data. Use variables such as site as variables, not as unique tabs in a spreadsheet. Identify your variables. Your data is probably simpler than you think it is. The data produced by most studies can form a single data frame.

7.7 Summarising

summary(d)
##      Lshell         BTVolume         Site   
##  Min.   : 61.9   Min.   : 5.00   Site_1:26  
##  1st Qu.: 97.0   1st Qu.:21.00   Site_2:25  
##  Median :106.9   Median :28.00   Site_3: 8  
##  Mean   :106.8   Mean   :27.81   Site_4: 8  
##  3rd Qu.:118.7   3rd Qu.:35.00   Site_5:21  
##  Max.   :132.6   Max.   :59.00   Site_6:25

The summary command is useful for small data sets such as this. It provides us with the mean, median, 1st and 3rd quartiles, maximum and minima. For the factor it shows the sample size for each level.

7.8 Histogram of a variable

These data consist of measurements taken at six different sites. So to fully understand the distribution of the variability you should “condition” on the site, i.e. plot a histogram for each site. You will see how to do this easily at a later stage. In the very early stages of data analysis can still be useful to look at the distribution of the variable in aggregate form.

hist(d$Lshell)

This suggests that the shell lengths are approximately normally distributed. More on this later.

7.9 Boxplot

A boxplot is another useful way of understanding the variability. If the box looks approximately symetrical this is another indication that the variability as a whole is approximately normally distributed.

boxplot(d$Lshell)

Try these commands for BTVolume.

7.10 Quick plot of two variables

Base R is able to adapt a single command to plot two variables according to the nature of the variables involved.

So if we ask R to plot BTVolume against Lshell length we get a simple default scatterplot.

plot(d$BTVolume~d$Lshell)

If we ask for a plot of LShell against site we get boxplots.

plot(d$Lshell~d$Site)

These are both useful for rapid visualisation of the basic structure of the data. Exploratory data analysis when performed on a new data set is useful for deciding on the next steps required in order to answer questions from the data and present data to the target audience. Usually no one will see these simple figures except you. However …. this course is different, as you are learning about the methods and how to apply them. So when you carry out exploratory data analysis ensure that you embed all the steps in R as code chunks and annotate them. Write some text explaining what you have found out from the exploratory analysis.

7.11 Commands to remember

The R commands required for basic exploratory analysis are few in number and very easy to remember.

  1. str(d)
  2. summary(d)
  3. hist(d$x)
  4. boxplot(d$x)
  5. plot(y~x)

You can go a long way towards exploring a typical, simple data set using just these commands.

7.12 Appendix: The data set

Small data tables such as this can be embedded within a data report itself and explored quickly when the report is knitted up. The dt command in the aqm package is all that is needed

dt(d)

7.13 Video explanation

Not the most polished zoom (I forgot where the ~ sign is o be found on a mac keyboard … I don’t usually use a mac). Also I discovered after I had recorded it that the knitted documents are not visible. The screen being recorded is in another window. Live and learn. I do need to go back and re record this one.

However I hope it helps a bit as it is.

Video 2