Introduction

This is a small data set that is provided for practice. The code chunks are hidden in this document. They can be revealed by clicking on the code tab.

In order to practice using R try to replicate the analyses and extend the analysis to include other elements in the data frame. You can copy the necessary code from this docuement.

Take great care to ensure that you run all the code in the correct order!

All the mistakes that arise when running correctly typed R code are due to errors in the order. Error messages such as “Error: object ‘x’ not found” indicate that data has not been loaded or a model has not yet been fitted to the data. This can arise if you are simply copying and pasting code without thinking about what parts of the data the code is working on and what the code should produce. Be very careful.

The code has also been folded in this document into tabbed sections. This has been done to make it more difficult to simply cut and paste code without thought.

Loading libraries

library(tidyverse)
library(ggplot2)
library(plotly)
library(aqm)
theme_set(theme_bw())

Loading the data

IN order to practtice running analyses within a project you can add the file to the folder using the aqm package.

aqm::add_file("whitby.csv")
## When the file is available in the same folder this will work
d<-read.csv("whitby.csv")

Examples

Quick diagnostics

Structure

str(d)
## 'data.frame':    52 obs. of  9 variables:
##  $ year               : int  1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ...
##  $ max.temp           : num  11.7 11.2 10.8 11.9 11.2 ...
##  $ min.temp           : num  5.62 5.28 5 5.78 5.38 ...
##  $ days.of.air.frost  : num  4.5 3.67 4.5 3.08 3.33 ...
##  $ rainfall           : num  38.3 37.4 43.7 29.5 47.8 ...
##  $ hours.sun          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Setting            : Factor w/ 3 levels "Rural","Semi-rural",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bee.diversity      : num  7 6.25 5.83 5.92 5.42 ...
##  $ Butterfly.diversity: num  8.75 9.42 8.42 10.33 9.42 ...

Summary

summary(d)
##       year         max.temp        min.temp     days.of.air.frost
##  Min.   :1961   Min.   :10.76   Min.   :5.000   Min.   :1.167    
##  1st Qu.:1974   1st Qu.:11.49   1st Qu.:5.708   1st Qu.:1.854    
##  Median :1986   Median :12.01   Median :6.025   Median :2.708    
##  Mean   :1986   Mean   :12.28   Mean   :6.129   Mean   :2.702    
##  3rd Qu.:1999   3rd Qu.:13.10   3rd Qu.:6.722   3rd Qu.:3.542    
##  Max.   :2012   Max.   :14.16   Max.   :7.150   Max.   :5.167    
##                 NA's   :9       NA's   :7       NA's   :6        
##     rainfall       hours.sun           Setting   Bee.diversity  
##  Min.   :29.52   Min.   :108.6   Rural     :15   Min.   :1.917  
##  1st Qu.:44.13   1st Qu.:119.8   Semi-rural:12   1st Qu.:2.417  
##  Median :47.99   Median :130.4   Urban     :25   Median :3.583  
##  Mean   :48.94   Mean   :129.9                   Mean   :3.769  
##  3rd Qu.:54.72   3rd Qu.:136.9                   3rd Qu.:4.979  
##  Max.   :69.90   Max.   :158.7                   Max.   :7.083  
##  NA's   :10      NA's   :18                                     
##  Butterfly.diversity
##  Min.   : 2.500     
##  1st Qu.: 3.667     
##  Median : 4.583     
##  Mean   : 5.812     
##  3rd Qu.: 7.979     
##  Max.   :11.167     
## 

Histogram

hist(d$Bee.diversity)

Boxplot

boxplot(d$Bee.diversity)

Conditional boxplot

plot(d$Bee.diversity~d$Setting)

Quick scatterplot

plot(d$Bee.diversity~d$Butterfly.diversity)

And so on. You can try all the simple examples on the data informally as a way of quickly getting a feel for the patterns.

Regression example

Scatterplot using ggplot to add line with confidence intervals

g1<-ggplot(d, aes(x=Bee.diversity, y=Butterfly.diversity)) 

g2 <- g1 + geom_point()
g3 <- g2 + geom_smooth(method="lm")
g3

g1<-ggplot(d, aes(x=Bee.diversity, y=Butterfly.diversity, col=Setting)) 

g2 <- g1 + geom_point()
g3 <- g2 + geom_smooth(method="lm")
g3

Fiting and summarising the model

mod<-lm(data=d, Butterfly.diversity~Bee.diversity)
summary(mod)
## 
## Call:
## lm(formula = Butterfly.diversity ~ Bee.diversity, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2001 -0.6716 -0.1235  0.5945  4.4315 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.1236     0.4221  -0.293    0.771    
## Bee.diversity   1.5751     0.1038  15.175   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.143 on 50 degrees of freedom
## Multiple R-squared:  0.8216, Adjusted R-squared:  0.818 
## F-statistic: 230.3 on 1 and 50 DF,  p-value: < 2.2e-16

Anova example

Boxplot

g1<-ggplot(d, aes(x=Setting, y=Butterfly.diversity)) + geom_boxplot()  
g1

Confidence interval plot

g1<-ggplot(d, aes(x=Setting, y=Butterfly.diversity)) +stat_summary(fun.y=base::mean,geom="point") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g1

One way anova

Heterogenity of variance is appraent, but there is no point in making any correction as the p-value is so vanishingly small. There is no possibility of type one error here, as the null hypothesis itself is not credible.

mod<-aov(data=d,Butterfly.diversity~Setting)
summary(mod)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## Setting      2  331.1  165.57   231.7 <2e-16 ***
## Residuals   49   35.0    0.71                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise comparisons

The important part to note are the confidence intervals for the differences. The p-values are vanishingly small and so of no “signficance”.

TukeyHSD(mod)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Butterfly.diversity ~ Setting, data = d)
## 
## $Setting
##                       diff       lwr       upr p adj
## Semi-rural-Rural -3.486111 -4.277470 -2.694752     0
## Urban-Rural      -5.938889 -6.606221 -5.271557     0
## Urban-Semi-rural -2.452778 -3.170354 -1.735202     0

Some alternative coding

mod<-lm(data=d, Bee.diversity~ Setting)
anova(mod)
## Analysis of Variance Table
## 
## Response: Bee.diversity
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Setting    2 106.921  53.460  182.74 < 2.2e-16 ***
## Residuals 49  14.335   0.293                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If heterogenity of variance were a problem (it isn’t here) you could use this code for the overall Anova. Making corrections for heterogenity of variance helps to convince referees and tutors that you have been rigorous in your use of statistics. It rarely has much influence on your findings. If unbalanced samples and extreme heterogenity of variance is a real problem then more advanced methods based on Bayesian inference are necessary. Those fall well outside the scope of this unit.

library(sandwich)
library(car)
Anova(mod,white.adjust='hc3')
## Analysis of Deviance Table (Type II tests)
## 
## Response: Bee.diversity
##           Df      F    Pr(>F)    
## Setting    2 167.32 < 2.2e-16 ***
## Residuals 49                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(multcomp)  ## Additional library ... usually move to top of script
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
par(mar=c(3,8,3,3)) ## This line widens the margins so the pairs can be printed properly
plot(glht(mod, linfct = mcp(Setting = "Tukey")))

TukeyHSD  (aov(d$Bee.diversity~d$Setting))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = d$Bee.diversity ~ d$Setting)
## 
## $`d$Setting`
##                       diff       lwr       upr p adj
## Semi-rural-Rural -1.618056 -2.124358 -1.111753     0
## Urban-Rural      -3.344444 -3.771396 -2.917493     0
## Urban-Semi-rural -1.726389 -2.185486 -1.267292     0

Time series example

Fitting a linear regression model

You can fit a linear regression to the time series, but it is apparent that the trend is not uniform over time.

ggplot(d, aes(x=year, Bee.diversity)) + geom_line() + geom_point() + geom_smooth(method="lm")-> g1
g1

Fitting a spline

ggplot(d, aes(x=year, Bee.diversity)) + geom_line() + geom_point() + geom_smooth()-> g1
g1

Pivot longer example

Bee diversity and butterfly diversity can be shown on the same plot. To do this the data frame must be lengthened. The code chooses the coolumns that will be used for the values (in this case columns 8 and 9) and then plots the values with coloured groups for the names (column headings).

pivot_longer(d,cols=c(8,9)) %>%
ggplot(aes(x=year, y=value, col=name)) + geom_line() + geom_point() + geom_smooth() -> g1
g1

Facet wrapping

A similar process to that used to plot both bee and butterfly diversity on a single plot can be extended to produce facetted plots. Again the columns are chosen and used as the values, with the names of the columns now being used for the facet titles.

pivot_longer(d,cols=c(2,3,4,5,8,9)) %>%
ggplot(aes(x=year, y=value)) + geom_line() + geom_point() +  geom_smooth()+ facet_wrap(~name,scales="free")-> g1
g1

Extracting residuals

Temperature shows a curvilinear response over time

library(mgcv)

ggplot(d,aes(x=year,y=max.temp)) + geom_point() + geom_smooth(method="gam",formula = y~s(x))

mod<-gam(data=d,max.temp~s(year))
## Unfortunately there are missing data in the time series. This means that a bit of tricky code is needed to add the residuals to the data frame

d$max.temp.dif[!is.na(d$max.temp)] <- residuals(mod)
ggplot(d,aes(x=year,y=max.temp.dif)) + geom_point() + geom_line() + geom_hline(yintercept=0,colour="red")

The data

aqm::dt(d)

Conditional analysis

ggplot(d,aes(x=year,y=Bee.diversity)) + geom_line() + facet_wrap(~Setting)