This is a small data set that is provided for practice. The code chunks are hidden in this document. They can be revealed by clicking on the code tab.
In order to practice using R try to replicate the analyses and extend the analysis to include other elements in the data frame. You can copy the necessary code from this docuement.
Take great care to ensure that you run all the code in the correct order!
All the mistakes that arise when running correctly typed R code are due to errors in the order. Error messages such as “Error: object ‘x’ not found” indicate that data has not been loaded or a model has not yet been fitted to the data. This can arise if you are simply copying and pasting code without thinking about what parts of the data the code is working on and what the code should produce. Be very careful.
The code has also been folded in this document into tabbed sections. This has been done to make it more difficult to simply cut and paste code without thought.
library(tidyverse)
library(ggplot2)
library(plotly)
library(aqm)
theme_set(theme_bw())
IN order to practtice running analyses within a project you can add the file to the folder using the aqm package.
aqm::add_file("whitby.csv")
## When the file is available in the same folder this will work
d<-read.csv("whitby.csv")
str(d)
## 'data.frame': 52 obs. of 9 variables:
## $ year : int 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ...
## $ max.temp : num 11.7 11.2 10.8 11.9 11.2 ...
## $ min.temp : num 5.62 5.28 5 5.78 5.38 ...
## $ days.of.air.frost : num 4.5 3.67 4.5 3.08 3.33 ...
## $ rainfall : num 38.3 37.4 43.7 29.5 47.8 ...
## $ hours.sun : num NA NA NA NA NA NA NA NA NA NA ...
## $ Setting : Factor w/ 3 levels "Rural","Semi-rural",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Bee.diversity : num 7 6.25 5.83 5.92 5.42 ...
## $ Butterfly.diversity: num 8.75 9.42 8.42 10.33 9.42 ...
summary(d)
## year max.temp min.temp days.of.air.frost
## Min. :1961 Min. :10.76 Min. :5.000 Min. :1.167
## 1st Qu.:1974 1st Qu.:11.49 1st Qu.:5.708 1st Qu.:1.854
## Median :1986 Median :12.01 Median :6.025 Median :2.708
## Mean :1986 Mean :12.28 Mean :6.129 Mean :2.702
## 3rd Qu.:1999 3rd Qu.:13.10 3rd Qu.:6.722 3rd Qu.:3.542
## Max. :2012 Max. :14.16 Max. :7.150 Max. :5.167
## NA's :9 NA's :7 NA's :6
## rainfall hours.sun Setting Bee.diversity
## Min. :29.52 Min. :108.6 Rural :15 Min. :1.917
## 1st Qu.:44.13 1st Qu.:119.8 Semi-rural:12 1st Qu.:2.417
## Median :47.99 Median :130.4 Urban :25 Median :3.583
## Mean :48.94 Mean :129.9 Mean :3.769
## 3rd Qu.:54.72 3rd Qu.:136.9 3rd Qu.:4.979
## Max. :69.90 Max. :158.7 Max. :7.083
## NA's :10 NA's :18
## Butterfly.diversity
## Min. : 2.500
## 1st Qu.: 3.667
## Median : 4.583
## Mean : 5.812
## 3rd Qu.: 7.979
## Max. :11.167
##
hist(d$Bee.diversity)
boxplot(d$Bee.diversity)
plot(d$Bee.diversity~d$Setting)
plot(d$Bee.diversity~d$Butterfly.diversity)
And so on. You can try all the simple examples on the data informally as a way of quickly getting a feel for the patterns.
g1<-ggplot(d, aes(x=Bee.diversity, y=Butterfly.diversity))
g2 <- g1 + geom_point()
g3 <- g2 + geom_smooth(method="lm")
g3
g1<-ggplot(d, aes(x=Bee.diversity, y=Butterfly.diversity, col=Setting))
g2 <- g1 + geom_point()
g3 <- g2 + geom_smooth(method="lm")
g3
mod<-lm(data=d, Butterfly.diversity~Bee.diversity)
summary(mod)
##
## Call:
## lm(formula = Butterfly.diversity ~ Bee.diversity, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2001 -0.6716 -0.1235 0.5945 4.4315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1236 0.4221 -0.293 0.771
## Bee.diversity 1.5751 0.1038 15.175 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.143 on 50 degrees of freedom
## Multiple R-squared: 0.8216, Adjusted R-squared: 0.818
## F-statistic: 230.3 on 1 and 50 DF, p-value: < 2.2e-16
g1<-ggplot(d, aes(x=Setting, y=Butterfly.diversity)) + geom_boxplot()
g1
g1<-ggplot(d, aes(x=Setting, y=Butterfly.diversity)) +stat_summary(fun.y=base::mean,geom="point") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g1
Heterogenity of variance is appraent, but there is no point in making any correction as the p-value is so vanishingly small. There is no possibility of type one error here, as the null hypothesis itself is not credible.
mod<-aov(data=d,Butterfly.diversity~Setting)
summary(mod)
## Df Sum Sq Mean Sq F value Pr(>F)
## Setting 2 331.1 165.57 231.7 <2e-16 ***
## Residuals 49 35.0 0.71
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The important part to note are the confidence intervals for the differences. The p-values are vanishingly small and so of no “signficance”.
TukeyHSD(mod)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Butterfly.diversity ~ Setting, data = d)
##
## $Setting
## diff lwr upr p adj
## Semi-rural-Rural -3.486111 -4.277470 -2.694752 0
## Urban-Rural -5.938889 -6.606221 -5.271557 0
## Urban-Semi-rural -2.452778 -3.170354 -1.735202 0
mod<-lm(data=d, Bee.diversity~ Setting)
anova(mod)
## Analysis of Variance Table
##
## Response: Bee.diversity
## Df Sum Sq Mean Sq F value Pr(>F)
## Setting 2 106.921 53.460 182.74 < 2.2e-16 ***
## Residuals 49 14.335 0.293
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If heterogenity of variance were a problem (it isn’t here) you could use this code for the overall Anova. Making corrections for heterogenity of variance helps to convince referees and tutors that you have been rigorous in your use of statistics. It rarely has much influence on your findings. If unbalanced samples and extreme heterogenity of variance is a real problem then more advanced methods based on Bayesian inference are necessary. Those fall well outside the scope of this unit.
library(sandwich)
library(car)
Anova(mod,white.adjust='hc3')
## Analysis of Deviance Table (Type II tests)
##
## Response: Bee.diversity
## Df F Pr(>F)
## Setting 2 167.32 < 2.2e-16 ***
## Residuals 49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(multcomp) ## Additional library ... usually move to top of script
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
par(mar=c(3,8,3,3)) ## This line widens the margins so the pairs can be printed properly
plot(glht(mod, linfct = mcp(Setting = "Tukey")))
TukeyHSD (aov(d$Bee.diversity~d$Setting))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = d$Bee.diversity ~ d$Setting)
##
## $`d$Setting`
## diff lwr upr p adj
## Semi-rural-Rural -1.618056 -2.124358 -1.111753 0
## Urban-Rural -3.344444 -3.771396 -2.917493 0
## Urban-Semi-rural -1.726389 -2.185486 -1.267292 0
You can fit a linear regression to the time series, but it is apparent that the trend is not uniform over time.
ggplot(d, aes(x=year, Bee.diversity)) + geom_line() + geom_point() + geom_smooth(method="lm")-> g1
g1
ggplot(d, aes(x=year, Bee.diversity)) + geom_line() + geom_point() + geom_smooth()-> g1
g1
Bee diversity and butterfly diversity can be shown on the same plot. To do this the data frame must be lengthened. The code chooses the coolumns that will be used for the values (in this case columns 8 and 9) and then plots the values with coloured groups for the names (column headings).
pivot_longer(d,cols=c(8,9)) %>%
ggplot(aes(x=year, y=value, col=name)) + geom_line() + geom_point() + geom_smooth() -> g1
g1
A similar process to that used to plot both bee and butterfly diversity on a single plot can be extended to produce facetted plots. Again the columns are chosen and used as the values, with the names of the columns now being used for the facet titles.
pivot_longer(d,cols=c(2,3,4,5,8,9)) %>%
ggplot(aes(x=year, y=value)) + geom_line() + geom_point() + geom_smooth()+ facet_wrap(~name,scales="free")-> g1
g1
Temperature shows a curvilinear response over time
library(mgcv)
ggplot(d,aes(x=year,y=max.temp)) + geom_point() + geom_smooth(method="gam",formula = y~s(x))
mod<-gam(data=d,max.temp~s(year))
## Unfortunately there are missing data in the time series. This means that a bit of tricky code is needed to add the residuals to the data frame
d$max.temp.dif[!is.na(d$max.temp)] <- residuals(mod)
ggplot(d,aes(x=year,y=max.temp.dif)) + geom_point() + geom_line() + geom_hline(yintercept=0,colour="red")
aqm::dt(d)
ggplot(d,aes(x=year,y=Bee.diversity)) + geom_line() + facet_wrap(~Setting)