R Class 6: Exploring data

Introduction

So far on the course you’ve seen how some more advanced R code can be used to carry out some quite complex GIS work though reproducible and reusable code. You have looked at the fundamental basis of inferential statistics and applied this to simple study design through power analysis. One way anova is a common technique for looking at differences between group means, and you’ve seen this in action. One way anova takes as input one categorical variable (factor) and one numerical variable. In the class that follows this one we will look at regression as a technique for understanding the relationship between two numerical variables. Before going on to this it is worth looking again at some steps that you can take when you load a data frame from a CSV file into R in order to understand your data.

Reading the data in

Let’s reload the simple mussels data set. We can get this from an online source. You should change this line to read in the Arne data set that you have for the assignment.

d<-read.csv("https://tinyurl.com/aqm-data/mussels.csv")

Finding out about the data using str

The usual first step after loading any data frame is to ask R about its structure. None of these exploratory steps really need to be included in any finished markdown document, so you can include them in temporary code chunks that are later removed if you want. However you should document the meaning of all the variables at this point.

str(d)

## 'data.frame':    113 obs. of  3 variables:
##  $ Lshell  : num  122.1 100.1 100.7 102.3 94.9 ...
##  $ BTVolume: int  39 21 23 22 20 22 21 18 21 15 ...
##  $ Site    : Factor w/ 6 levels "Site_1","Site_2",..: 6 6 6 6 6 6 6 6 6 6 ...

So we have site, which is factor with six levels. Lshell represents the length of mussel shells in mm measured to a precision of 1/10 of a mm (This may be rather too precise, but the data I got was in this form). BTVolume is the body volume of the mussel in ml.

Exercise for the assignment

Start a new document withe the code chunks as shown above, but substituting the line to read in the mussels data for a line that reads in the Arne data. Check that all your loaded variables are in the expected form and write some text explaining the meaning of all your variables, how they were measured and the units.

Making your data available for others

Providing your data frame is not too large it is possible, and convenient, to include all your data in a compiled HTML (i.e. web page) using R Markdown. This is useful for those who still like to use spreadsheets for data analysis or may wish to export the data into any other alternative software for analysis. If you are passing your work to another R user then the original markdown file together with the data in CSV format is the best way.

The function is using a little wrapper included in the first code chunk of this document to simplify input. You will need to include this in order to use the function.

This produces a searchable spreadsheet-like data table that is embedded in your document. Clicking on the heads of columns sorts the data. You can also apply various filters to subset the data, which can then be exported to the clipboard, as a CSV or native Excel file. Take some time to play with this feature.

dt(d)

Exercise for the assignment.

Add the code chunk above to your document and use it to explore your data visually. Try copying subsets of the data to an Excel spreadsheet.

Looking at the distribution of single variables

Before conducting any full analysis it is a good idea to understand the overall distribution of all your numerical variables. This takes the form of an “informal” exploration of the data and would not be included in the final report. However you may want to include a table summarising these properties for each group or factor level in a report. We will see how to do this quickly later in this document.

To get a quick summary of key properties of a variable just type summary in a code chunk and place the variable name within brackets with the name of the data frame first. eg. summary(d$Lshell)

summary(d$Lshell)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.9    97.0   106.9   106.8   118.7   132.6

This shows the minimum value, the first quartile, the mean, the median, the third quartile and the maximum. These quantities more or less coincide with those shown in a boxplot. For purely exploratory analysis base graphics in R can be used to quickly visualise this boxplot.

boxplot(d$Lshell)

Likewise a very quick histogram with default settings can be made using base graphics with a single command.

hist(d$Lshell)

Some people still like the rather old fashioned stem and leaf plot, that was once widely used to summarise data tables but has fallen out of use recently.

stem(d$Lshell)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    6 | 2
##    6 | 9
##    7 | 4
##    7 | 789
##    8 | 244
##    8 | 689
##    9 | 0011224444
##    9 | 555667777
##   10 | 000000011122222233344
##   10 | 566677899
##   11 | 001223344444
##   11 | 56667777889999
##   12 | 00111112233444
##   12 | 5666888
##   13 | 01223

A rather nicer histogram can be made using ggplots. You can alter the bin width to vary the results.

h0<-ggplot(d,aes(x=Lshell))
h0+geom_histogram(color="grey",binwidth = 5)

Skewed data

As the mussels length data are morphological measurements they tend to be approximately normally distributed. In other words they form a fairly symmetrical histogram and there are few, if any, points beyond the whiskers of the boxplot. A very different pattern often arises with ecological data. Such data are very often right skewed, sometimes dramatically so. Right skew arises for many reasons and is not to be unexpected. Population sizes are usually right skewed. Areas of fragments of forest are right skewed. Concentrations of minerals in soil tend to be right skewed. The underlying pattern in all these examples can be represented by a log normal, rather than a normal distribution.

a_right_skewed_variable<-rlnorm(100,1,1)
hist(a_right_skewed_variable)

boxplot(a_right_skewed_variable)

Notice how the histogram has a tendency towards forming a J shape, and boxplots have many points above the top whisker. This is a clear sign of right skew. In this case the median will fall below the mean.

mean(a_right_skewed_variable)

## [1] 4.482358

median(a_right_skewed_variable)

## [1] 3.048601

Other distributions

When we are looking at data taken from a spatial pattern there is no real reason for the distribution to follow and mathematically defined pattern. Instead it will be esoteric and represent the relative frequency of observations on the ground within the particular study site. If such data are used as independent covariates in an analysis this may, or may not, be a problem when applying standard statistical techniques such as regression. We will look at this in more detail later along with considering why the stochastic component of dependent variables may follow non-normal distributions.

Exercise for the assignment

Make histograms and boxplots of all the numerical variables in your data set. Interpret the results. Which variables show approximately normal distributions? Which are right skewed.

Summarising by group

Those who have used Excel may find the long data frame format awkward to work with. This is because you will have been used to summarising data columns rather than summarising groups. It is possible to do both in Excel, but many people tend to place all the calculations “on the bottom line” of a spreadsheet. There are, however, many problems with this when analysing large and complex data sets. Apart from the problem of having to reformat “wide” data for R (which can be solved using R code, but makes unnecessary work), the difficulty with placing data summaries below columns is that there is no simple way to move them up a level and use them as input to further analysis. Think of the example of weather data with daily records of rainfall, temperature and evaporation. We may want to turn them into weekly, monthly or yearly means. This is easy under a grouped data format, but complex otherwise.

There are many different ways to summarise data by group in R, so if you look on the internet you will see different authors taking different approaches. The conventional modern way is to use the dplyr package. Code using dplyr often uses the %>% “pipe” operator to link commands.

The two steps needed to produce a grouped data summary is to first define the groups by selecting a categorical variable (or several categorical variables). Then tell R which statistics you want to use in the summary.

### Grouping shell lengths by site

d %>% group_by(Site) %>%
  summarise(
    n=n(),
    mean=mean(Lshell),
    median=median(Lshell),
    sd=sd(Lshell),
    se=sd/sqrt(n),
    max= max(Lshell),
    min=min(Lshell)) -> gd

This code leads to a new, neatly formatted, data table with the data summarised using the statistics we asked for. Notice that the standard error of the mean was derived from the standard deviation and n (the sample size for each group).

dt(gd)

If you have NAs (not available) values in the data this won’t work as expected, as R will give NA as the result of applying a function on any vector with an NA value. So for example, if we set one of the shell lengths to NA (in site 6) we will get NA values unless we specifically tell R to remove the NAs. Look at this example.

d$Lshell[1]<-NA


d %>% group_by(Site) %>%
  summarise(
    n=n(),
    mean=mean(Lshell,na.rm=TRUE),
    median=median(Lshell,na.rm=TRUE),
    sd=sd(Lshell),
    se=sd/sqrt(n),
    max= max(Lshell),
    min=min(Lshell)) -> gd_na
gd_na

## # A tibble: 6 x 8
##     Site     n     mean median        sd       se    max   min
##   <fctr> <int>    <dbl>  <dbl>     <dbl>    <dbl>  <dbl> <dbl>
## 1 Site_1    26 100.7692 100.70 12.396153 2.431086 125.90  76.6
## 2 Site_2    25 109.2360 113.80 14.003240 2.800648 130.60  77.5
## 3 Site_3     8 106.8063 114.50 24.061089 8.506880 132.55  61.9
## 4 Site_4     8  97.1500  98.15 19.327035 6.833139 127.70  69.3
## 5 Site_5    21 119.4667 120.50  8.816708 1.923963 132.20 101.1
## 6 Site_6    25 102.4542 100.90        NA       NA     NA    NA

So if your data has NAs do be aware of this. You will need to add in the appropriate code to remove them. The best way to deal with missing values is to avoid having them in the first place!

Rounding

The summary table shows the calculated means, medians and standard deviations with more precision that is necessary. The data have been measure to one decimal point. We can round the quantities to two decimal points, as a general rule is that calculated quantities can be shown to one decimal point more than the measurements they are based on.

gd$mean<-round(gd$mean,2)
gd$sd<-round(gd$sd,2)
gd$se<-round(gd$se,2)
dt(gd)

There are other, slightly more complex, rules that can be applied to data in scientific format in order to obtain the right number of significant digits.

Grouping by continuous variables

Although most of the time you will want to analyse a continuous numerical variable using the original values, there is sometimes a need to turn a continuous variable into a grouping factor. For example, in the Arne data it might occur to you to analyse group some variable such as slope to form a categorical variable with levels representing flat areas and sloping areas. Just as an example of how this can be done let’s group the mussels into large and small.

median(d$BTVolume)

## [1] 28

d$size<-cut(d$BTVolume,c(0,28,100))
levels(d$size)

## [1] "(0,28]"   "(28,100]"

levels(d$size)<-c("Small","Large")
levels(d$size)

## [1] "Small" "Large"

You might want to use a technique such as this with the assignment data if you have a clear reason to split a continuum into categories.

Exploring the relationship between two numerical variables

In the next class we will look in detail at the assumptions underlying building regression models. The idea that two numerical variables are related in some respect to one another is the basis of many forms of statistical modelling. In essence the idea is simple enough. We are interested in finding out if an increase in x leads to an increase or decrease in y. However there are a range of subtleties involved in looking at this which leads to a wide range of techniques that can be applied.

The first step that should be taken before any model building is undertaken is to look at the data and explore any patterns. In the mussel data there are two variables that a priori must be related which each other. These are shell length and body volume. So testing the hypothesis that the two are related makes little sense. In this particular case it makes more sense to try to find some calibrated relationship between the two variables.

We can plot the variables out using ggplot. First set the aesthetics for x and y. Then add the point geometry.

s0<-ggplot(d,aes(x=Lshell,y=BTVolume))
s1<-s0+geom_point()
s1

To look at the basic shape of the relationship we can add a smoothed line.

s1+geom_smooth()

Exercise for the assignment

Look at the visual relationships between a range of variables collected at Arne. We will formalise the analysis in the next few classes. Don’t be concerned if the relationships are not as clear as in this example. Sometimes not finding a pattern is just as interesting from a scientific point of view as finding one, although it can be more challenging to write up such results formally.

R Class 6: Exploring data

Duncan Golicher

11/15/2017

Introduction

Reading the data in

Finding out about the data using str

Exercise for the assignment

Making your data available for others

Exercise for the assignment.

Looking at the distribution of single variables

Skewed data

Other distributions

Exercise for the assignment

Summarising by group

Rounding

Grouping by continuous variables

Exploring the relationship between two numerical variables

Exercise for the assignment