The simplest exploratory techniques just use the functions available in the base R language. So there is no need for a long list of packages to be loaded at this stage. Some data is included in my aqm package on the server. So that does need to be added.
library(aqm)
##
## Attaching package: 'aqm'
## The following object is masked from 'package:stats':
##
## dt
The next step is to load in the data. This will usually involve reading a table of data from a csv file or a correctly formatted spreadsheet. In this case there is data in the aqm package that we can use.
data(mussels)
This loads the mussels data. If you look in the environment pane you will see a data frame. Click on it to visualise it directly.
To save typing when working with one single table I often assign it to a data frame just called d. This is a personal style. Not everyone does this. You can type out mussels to refer to the data if you wish.
d<-mussels
The str command results in a description of the structure of the data
str(d)
## 'data.frame': 113 obs. of 3 variables:
## $ Lshell : num 122.1 100.1 100.7 102.3 94.9 ...
## $ BTVolume: int 39 21 23 22 20 22 21 18 21 15 ...
## $ Site : Factor w/ 6 levels "Site_1","Site_2",..: 6 6 6 6 6 6 6 6 6 6 ...
We have a data frame consisting of a numerical variable called Lshell, a numerical (integer) variable called BTVolume and a factor with six levels called Site.
At this point you would refer to the “metadata” (data about data) to check the definition of these variables and find out the units of measurement, site characteristics etc.
summary(d)
## Lshell BTVolume Site
## Min. : 61.9 Min. : 5.00 Site_1:26
## 1st Qu.: 97.0 1st Qu.:21.00 Site_2:25
## Median :106.9 Median :28.00 Site_3: 8
## Mean :106.8 Mean :27.81 Site_4: 8
## 3rd Qu.:118.7 3rd Qu.:35.00 Site_5:21
## Max. :132.6 Max. :59.00 Site_6:25
The summary command is useful for small data sets such as this. It provides us with the mean, median, 1st and 3rd quartiles, maximum and minima. For the factor it shows the sample size for each level.
These data consist of measurements taken at six different sites. So to fully understand the distribution of the variability you should “condition” on the site, i.e. plot a histogram for each site. You will see how to do this easily at a later stage. In the very early stages of data analysis can still be useful to look at the distribution of the variable in aggregate form.
hist(d$Lshell)
This suggests that the shell lengths are approximately normally distributed. More on this later.
A boxplot is another useful way of understanding the variability. If the box looks approximately symetrical this is another indication that the variability as a whole is approximately normally distributed.
boxplot(d$Lshell)
Try these commands for BTVolume.
Base R is able to adapt a single command to plot two variables according to the nature of the variables involved.
So if we ask R to plot BTVolume against Lshell length we get a simple default scatterplot.
plot(d$BTVolume~d$Lshell)
If we ask for a plot of LShell against site we get boxplots.
plot(d$Lshell~d$Site)
These are both useful for rapid visualisation of the basic structure of the data. Exploratory data analysis when performed on a new data set is useful for deciding on the next steps required in order to answer questions from the data and present data to the target audience. Usually no one will see these simple figures except you. However …. this course is different, as you are learning about the methods and how to apply them. So when you carry out exploratory data analysis ensure that you embed all the steps in R as code chunks and annotate them. Write some text explaining what you have found out from the exploratory analysis.
The R commands required for basic exploratory analysis are few in number and very easy to remember.
You can go a long way towards exploring a typical, simple data set using just these commands.
Small data tables such as this can be embedded within a data report itself and explored quickly when the report is knitted up. The dt command in the aqm package is all that is needed
dt(d)
Not the most polished zoom (I forgot where the ~ sign is o be found on a mac keyboard … I don’t usually use a mac). Also I discovered after I had recorded it that the knitted documents are not visible. The screen being recorded is in another window. Live and learn. I do need to go back and re record this one.
However I hope it helps a bit as it is.