Examples

Quick diagnostics

Structure

str(d)

## 'data.frame':    52 obs. of  9 variables:
##  $ year               : int  1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ...
##  $ max.temp           : num  11.7 11.2 10.8 11.9 11.2 ...
##  $ min.temp           : num  5.62 5.28 5 5.78 5.38 ...
##  $ days.of.air.frost  : num  4.5 3.67 4.5 3.08 3.33 ...
##  $ rainfall           : num  38.3 37.4 43.7 29.5 47.8 ...
##  $ hours.sun          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Setting            : Factor w/ 3 levels "Rural","Semi-rural",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bee.diversity      : num  7 6.25 5.83 5.92 5.42 ...
##  $ Butterfly.diversity: num  8.75 9.42 8.42 10.33 9.42 ...

Summary

summary(d)

##       year         max.temp        min.temp     days.of.air.frost
##  Min.   :1961   Min.   :10.76   Min.   :5.000   Min.   :1.167    
##  1st Qu.:1974   1st Qu.:11.49   1st Qu.:5.708   1st Qu.:1.854    
##  Median :1986   Median :12.01   Median :6.025   Median :2.708    
##  Mean   :1986   Mean   :12.28   Mean   :6.129   Mean   :2.702    
##  3rd Qu.:1999   3rd Qu.:13.10   3rd Qu.:6.722   3rd Qu.:3.542    
##  Max.   :2012   Max.   :14.16   Max.   :7.150   Max.   :5.167    
##                 NA's   :9       NA's   :7       NA's   :6        
##     rainfall       hours.sun           Setting   Bee.diversity  
##  Min.   :29.52   Min.   :108.6   Rural     :15   Min.   :1.917  
##  1st Qu.:44.13   1st Qu.:119.8   Semi-rural:12   1st Qu.:2.417  
##  Median :47.99   Median :130.4   Urban     :25   Median :3.583  
##  Mean   :48.94   Mean   :129.9                   Mean   :3.769  
##  3rd Qu.:54.72   3rd Qu.:136.9                   3rd Qu.:4.979  
##  Max.   :69.90   Max.   :158.7                   Max.   :7.083  
##  NA's   :10      NA's   :18                                     
##  Butterfly.diversity
##  Min.   : 2.500     
##  1st Qu.: 3.667     
##  Median : 4.583     
##  Mean   : 5.812     
##  3rd Qu.: 7.979     
##  Max.   :11.167     
##

Histogram

hist(d$Bee.diversity)

Boxplot

boxplot(d$Bee.diversity)

Conditional boxplot

plot(d$Bee.diversity~d$Setting)

Quick scatterplot

plot(d$Bee.diversity~d$Butterfly.diversity)

And so on. You can try all the simple examples on the data informally as a way of quickly getting a feel for the patterns.

Regression example

Scatterplot using ggplot to add line with confidence intervals

g1<-ggplot(d, aes(x=Bee.diversity, y=Butterfly.diversity)) 

g2 <- g1 + geom_point()
g3 <- g2 + geom_smooth(method="lm")
g3

g1<-ggplot(d, aes(x=Bee.diversity, y=Butterfly.diversity, col=Setting)) 

g2 <- g1 + geom_point()
g3 <- g2 + geom_smooth(method="lm")
g3

Fiting and summarising the model

mod<-lm(data=d, Butterfly.diversity~Bee.diversity)
summary(mod)

## 
## Call:
## lm(formula = Butterfly.diversity ~ Bee.diversity, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2001 -0.6716 -0.1235  0.5945  4.4315 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.1236     0.4221  -0.293    0.771    
## Bee.diversity   1.5751     0.1038  15.175   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.143 on 50 degrees of freedom
## Multiple R-squared:  0.8216, Adjusted R-squared:  0.818 
## F-statistic: 230.3 on 1 and 50 DF,  p-value: < 2.2e-16

Anova example

Boxplot

g1<-ggplot(d, aes(x=Setting, y=Butterfly.diversity)) + geom_boxplot()  
g1

Confidence interval plot

g1<-ggplot(d, aes(x=Setting, y=Butterfly.diversity)) +stat_summary(fun.y=base::mean,geom="point") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g1

One way anova

Heterogenity of variance is appraent, but there is no point in making any correction as the p-value is so vanishingly small. There is no possibility of type one error here, as the null hypothesis itself is not credible.

mod<-aov(data=d,Butterfly.diversity~Setting)
summary(mod)

##             Df Sum Sq Mean Sq F value Pr(>F)    
## Setting      2  331.1  165.57   231.7 <2e-16 ***
## Residuals   49   35.0    0.71                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise comparisons

The important part to note are the confidence intervals for the differences. The p-values are vanishingly small and so of no “signficance”.

TukeyHSD(mod)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Butterfly.diversity ~ Setting, data = d)
## 
## $Setting
##                       diff       lwr       upr p adj
## Semi-rural-Rural -3.486111 -4.277470 -2.694752     0
## Urban-Rural      -5.938889 -6.606221 -5.271557     0
## Urban-Semi-rural -2.452778 -3.170354 -1.735202     0

Some alternative coding

mod<-lm(data=d, Bee.diversity~ Setting)
anova(mod)

## Analysis of Variance Table
## 
## Response: Bee.diversity
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Setting    2 106.921  53.460  182.74 < 2.2e-16 ***
## Residuals 49  14.335   0.293                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If heterogenity of variance were a problem (it isn’t here) you could use this code for the overall Anova. Making corrections for heterogenity of variance helps to convince referees and tutors that you have been rigorous in your use of statistics. It rarely has much influence on your findings. If unbalanced samples and extreme heterogenity of variance is a real problem then more advanced methods based on Bayesian inference are necessary. Those fall well outside the scope of this unit.

library(sandwich)
library(car)
Anova(mod,white.adjust='hc3')

## Analysis of Deviance Table (Type II tests)
## 
## Response: Bee.diversity
##           Df      F    Pr(>F)    
## Setting    2 167.32 < 2.2e-16 ***
## Residuals 49                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(multcomp)  ## Additional library ... usually move to top of script

## Loading required package: mvtnorm

## Loading required package: survival

## Loading required package: TH.data

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:plotly':
## 
##     select

## The following object is masked from 'package:dplyr':
## 
##     select

## 
## Attaching package: 'TH.data'

## The following object is masked from 'package:MASS':
## 
##     geyser

par(mar=c(3,8,3,3)) ## This line widens the margins so the pairs can be printed properly
plot(glht(mod, linfct = mcp(Setting = "Tukey")))

TukeyHSD  (aov(d$Bee.diversity~d$Setting))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = d$Bee.diversity ~ d$Setting)
## 
## $`d$Setting`
##                       diff       lwr       upr p adj
## Semi-rural-Rural -1.618056 -2.124358 -1.111753     0
## Urban-Rural      -3.344444 -3.771396 -2.917493     0
## Urban-Semi-rural -1.726389 -2.185486 -1.267292     0

Time series example

Fitting a linear regression model

You can fit a linear regression to the time series, but it is apparent that the trend is not uniform over time.

ggplot(d, aes(x=year, Bee.diversity)) + geom_line() + geom_point() + geom_smooth(method="lm")-> g1
g1

Fitting a spline

ggplot(d, aes(x=year, Bee.diversity)) + geom_line() + geom_point() + geom_smooth()-> g1
g1

Pivot longer example

Bee diversity and butterfly diversity can be shown on the same plot. To do this the data frame must be lengthened. The code chooses the coolumns that will be used for the values (in this case columns 8 and 9) and then plots the values with coloured groups for the names (column headings).

pivot_longer(d,cols=c(8,9)) %>%
ggplot(aes(x=year, y=value, col=name)) + geom_line() + geom_point() + geom_smooth() -> g1
g1

Facet wrapping

A similar process to that used to plot both bee and butterfly diversity on a single plot can be extended to produce facetted plots. Again the columns are chosen and used as the values, with the names of the columns now being used for the facet titles.

pivot_longer(d,cols=c(2,3,4,5,8,9)) %>%
ggplot(aes(x=year, y=value)) + geom_line() + geom_point() +  geom_smooth()+ facet_wrap(~name,scales="free")-> g1
g1

Extracting residuals

Temperature shows a curvilinear response over time

library(mgcv)

ggplot(d,aes(x=year,y=max.temp)) + geom_point() + geom_smooth(method="gam",formula = y~s(x))

mod<-gam(data=d,max.temp~s(year))
## Unfortunately there are missing data in the time series. This means that a bit of tricky code is needed to add the residuals to the data frame

d$max.temp.dif[!is.na(d$max.temp)] <- residuals(mod)
ggplot(d,aes(x=year,y=max.temp.dif)) + geom_point() + geom_line() + geom_hline(yintercept=0,colour="red")

The data

aqm::dt(d)

Conditional analysis

ggplot(d,aes(x=year,y=Bee.diversity)) + geom_line() + facet_wrap(~Setting)

Whitby data

Duncan Golicher

10/29/2020

Introduction

Loading libraries

Loading the data

Examples

Quick diagnostics

Structure

Summary

Histogram

Boxplot

Conditional boxplot

Quick scatterplot

Regression example

Scatterplot using ggplot to add line with confidence intervals

Fiting and summarising the model

Anova example

Boxplot

Confidence interval plot

One way anova

Pairwise comparisons

Some alternative coding

Time series example

Fitting a linear regression model

Fitting a spline

Pivot longer example

Facet wrapping

Extracting residuals

The data

Conditional analysis