This document provides all the code chunks that may be useful in the context of the data analysis component of the assignment. The data set used to illustrate is the mussels data, that can be analysed using one way ANOVA and regression in the context of calibrating a relationship.
You should look through ALL the handouts provided on these techniques to understand the underlying theory. This “crib sheet” simply provides access to useful code.
Include this chunk at the top of you analysis to ensure that you have all the packages. It also includes the wrapper to add buttons to a data table if you want to use this. Remember that data tables can only be included in HTML documents.
library(ggplot2)
library(dplyr)
library(mgcv)
library(DT)
theme_set(theme_bw())
dt<-function(x) DT::datatable(x,
filter = "top",
extensions = c('Buttons'), options = list(
dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel'), colReorder = TRUE
))
d<-read.csv("https://tinyurl.com/aqm-data/mussels.csv")
dt(d)
Change the name of the variable to match a numerical variable in your own data set. The command removes NAs just in case you have them
summary(d$Lshell,na.rm=TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.9 97.0 106.9 106.8 118.7 132.6
Mean, median, standard deviation and variance.
mean(d$Lshell, na.rm=TRUE)
## [1] 106.835
median(d$Lshell, na.rm=TRUE)
## [1] 106.9
sd(d$Lshell, na.rm=TRUE)
## [1] 14.84384
var(d$Lshell, na.rm=TRUE)
## [1] 220.3397
Useful for your own quick visualisation.
boxplot(d$Lshell)
Useful for your own quick visualisation.
hist(d$Lshell)
This uses ggplot. Change the bin width if you want to use this.
g0<-ggplot(d,aes(x=d$Lshell))
g0+geom_histogram(color="grey",binwidth = 5)
In this data set there are two numerical variables. So we can run a linear regresion.
d<-read.csv("https://tinyurl.com/aqm-data/mussels.csv")
dt(d)
g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g0+geom_point()
Type the text you want for the x and y axes to replace the variable names
g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g1<-g0+geom_point() + geom_smooth(method="lm")
g1 + xlab("Some text for the x asis") + ylab("Some text for the y axis")
Change the names of the variables in the first line.
mod<-lm(data= d, BTVolume~Lshell)
summary(mod)
##
## Call:
## lm(formula = BTVolume ~ Lshell, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.828 -2.672 0.147 2.235 17.404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.02385 3.33917 -10.79 <2e-16 ***
## Lshell 0.59754 0.03096 19.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.864 on 111 degrees of freedom
## Multiple R-squared: 0.7704, Adjusted R-squared: 0.7684
## F-statistic: 372.5 on 1 and 111 DF, p-value: < 2.2e-16
anova(mod)
## Analysis of Variance Table
##
## Response: BTVolume
## Df Sum Sq Mean Sq F value Pr(>F)
## Lshell 1 8811.4 8811.4 372.49 < 2.2e-16 ***
## Residuals 111 2625.7 23.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(mod)
## 2.5 % 97.5 %
## (Intercept) -42.6406346 -29.4070662
## Lshell 0.5361881 0.6588891
d$residuals<-residuals(mod)
Look at the regression handout to understand these plots.
plot(mod,which=1)
plot(mod,which=2)
plot(mod,which=3)
plot(mod,which=4)
plot(mod,which=5)
Used if all else fails. Not needed with these data, but included for reference.
g0<-ggplot(d,aes(x=rank(Lshell),y=rank(BTVolume)))
g0+geom_point() + geom_smooth(method="lm")
cor.test(d$Lshell,d$BTVolume,method="spearman")
##
## Spearman's rank correlation rho
##
## data: d$Lshell and d$BTVolume
## S = 26143, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8912809
Only use if you suspect that the relationship is not well described by a straight line.
library(mgcv)
g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g1<-g0 + geom_point() + geom_smooth(method="gam", formula =y~s(x))
g1 + xlab("Some text for the x asis") + ylab("Some text for the y axis")
In this case the line is the same as the linear model. Get a summary using this code.
mod<-gam(data=d, BTVolume~s(Lshell))
summary(mod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## BTVolume ~ s(Lshell)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.8142 0.4557 61.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Lshell) 1.493 1.847 198.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.77 Deviance explained = 77.3%
## GCV = 23.993 Scale est. = 23.463 n = 113
If you do use this model remember that its only needed if you can’t use linear regression. Report the ajusted R squared value, the estimated degrees of freedom and the p-value for the smooth term (not the intercept). You must include the figure in your report, as that is the only way to show the shape of the response.
The purpose of one way anova is
Exploratory plots
g0<-ggplot(d,aes(x=Site,y=Lshell))
g0+geom_boxplot()
g0<-ggplot(d,aes(x=d$Lshell))
g1<-g0+geom_histogram(color="grey",binwidth = 5)
g1+facet_wrap(~Site) +xlab("Text for x label")
g0<-ggplot(d,aes(x=Site,y=Lshell))
g1<-g0+stat_summary(fun.y=mean,geom="point")
g1<-g1 +stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g1 +xlab("Text for x label") + ylab("Text for y label")
mod<-lm(data=d,Lshell~Site)
anova(mod)
## Analysis of Variance Table
##
## Response: Lshell
## Df Sum Sq Mean Sq F value Pr(>F)
## Site 5 5525 1105 6.1732 4.579e-05 ***
## Residuals 107 19153 179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod)
##
## Call:
## lm(formula = Lshell ~ Site, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.906 -8.340 1.031 9.231 30.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.769 2.624 38.405 < 2e-16 ***
## SiteSite_2 8.467 3.748 2.259 0.0259 *
## SiteSite_3 6.037 5.409 1.116 0.2669
## SiteSite_4 -3.619 5.409 -0.669 0.5049
## SiteSite_5 18.697 3.925 4.763 6.02e-06 ***
## SiteSite_6 2.471 3.748 0.659 0.5111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared: 0.2239, Adjusted R-squared: 0.1876
## F-statistic: 6.173 on 5 and 107 DF, p-value: 4.579e-05
slevels<-levels(d$Site)
d$Site<-relevel(d$Site,"Site_5")
mod<-lm(data=d,Lshell~Site)
summary(mod)
##
## Call:
## lm(formula = Lshell ~ Site, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.906 -8.340 1.031 9.231 30.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.467 2.920 40.919 < 2e-16 ***
## SiteSite_1 -18.697 3.925 -4.763 6.02e-06 ***
## SiteSite_2 -10.231 3.960 -2.583 0.011135 *
## SiteSite_3 -12.660 5.559 -2.278 0.024739 *
## SiteSite_4 -22.317 5.559 -4.015 0.000111 ***
## SiteSite_6 -16.227 3.960 -4.097 8.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared: 0.2239, Adjusted R-squared: 0.1876
## F-statistic: 6.173 on 5 and 107 DF, p-value: 4.579e-05
d$Site <- factor(d$Site, levels=slevels)
slevels<-levels(d$Site)
d$Site <- factor(d$Site, levels=rev(slevels))
mod<-lm(data=d,Lshell~Site)
summary(mod)
##
## Call:
## lm(formula = Lshell ~ Site, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.906 -8.340 1.031 9.231 30.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.240 2.676 38.583 < 2e-16 ***
## SiteSite_5 16.227 3.960 4.097 8.15e-05 ***
## SiteSite_4 -6.090 5.435 -1.121 0.265
## SiteSite_3 3.566 5.435 0.656 0.513
## SiteSite_2 5.996 3.784 1.584 0.116
## SiteSite_1 -2.471 3.748 -0.659 0.511
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared: 0.2239, Adjusted R-squared: 0.1876
## F-statistic: 6.173 on 5 and 107 DF, p-value: 4.579e-05
d$Site <- factor(d$Site, levels=slevels)
Sum contrasts compare the effects to the mean. Notice that the last level is missing due to the way the design matrix is formed. So it can be worth running sum contrasts twice, with the order reversed, to get all the contrasts.
options(contrasts = c("contr.sum", "contr.poly"))
mod<-lm(data=d,Lshell~Site)
summary(mod)
##
## Call:
## lm(formula = Lshell ~ Site, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.906 -8.340 1.031 9.231 30.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.1114 1.4384 73.773 < 2e-16 ***
## Site1 -5.3421 2.5804 -2.070 0.0408 *
## Site2 3.1246 2.6158 1.195 0.2349
## Site3 0.6949 4.1214 0.169 0.8664
## Site4 -8.9614 4.1214 -2.174 0.0319 *
## Site5 13.3553 2.7841 4.797 5.24e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared: 0.2239, Adjusted R-squared: 0.1876
## F-statistic: 6.173 on 5 and 107 DF, p-value: 4.579e-05
options(contrasts = c("contr.treatment", "contr.poly"))
d$Site <- factor(d$Site, levels=rev(slevels))
options(contrasts = c("contr.sum", "contr.poly"))
mod<-lm(data=d,Lshell~Site)
summary(mod)
##
## Call:
## lm(formula = Lshell ~ Site, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.906 -8.340 1.031 9.231 30.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.1114 1.4384 73.773 < 2e-16 ***
## Site1 -2.8714 2.6158 -1.098 0.2748
## Site2 13.3553 2.7841 4.797 5.24e-06 ***
## Site3 -8.9614 4.1214 -2.174 0.0319 *
## Site4 0.6949 4.1214 0.169 0.8664
## Site5 3.1246 2.6158 1.195 0.2349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared: 0.2239, Adjusted R-squared: 0.1876
## F-statistic: 6.173 on 5 and 107 DF, p-value: 4.579e-05
options(contrasts = c("contr.treatment", "contr.poly"))
d$Site <- factor(d$Site, levels=slevels)
Use to find where signficant differences lie. This should confirm the pattern shown using the confidence interval plot.
mod<-aov(data=d,Lshell~Site)
TukeyHSD(mod)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Lshell ~ Site, data = d)
##
## $Site
## diff lwr upr p adj
## Site_2-Site_1 8.466769 -2.408905 19.342443 0.2201442
## Site_3-Site_1 6.037019 -9.660664 21.734702 0.8737518
## Site_4-Site_1 -3.619231 -19.316914 12.078452 0.9849444
## Site_5-Site_1 18.697436 7.305950 30.088922 0.0000867
## Site_6-Site_1 2.470769 -8.404905 13.346443 0.9859123
## Site_3-Site_2 -2.429750 -18.201132 13.341632 0.9976925
## Site_4-Site_2 -12.086000 -27.857382 3.685382 0.2355928
## Site_5-Site_2 10.230667 -1.262165 21.723498 0.1103764
## Site_6-Site_2 -5.996000 -16.977781 4.985781 0.6105029
## Site_4-Site_3 -9.656250 -29.069479 9.756979 0.7004668
## Site_5-Site_3 12.660417 -3.470986 28.791819 0.2123990
## Site_6-Site_3 -3.566250 -19.337632 12.205132 0.9862071
## Site_5-Site_4 22.316667 6.185264 38.448069 0.0015143
## Site_6-Site_4 6.090000 -9.681382 21.861382 0.8718474
## Site_6-Site_5 -16.226667 -27.719498 -4.733835 0.0011239
Plot of results of Tukey HSD
plot(TukeyHSD(mod))
This will give you the overall Anova table if there is heterogeneity of variance.
library(sandwich)
library(car)
mod<-lm(Lshell~Site, data=d)
Anova(mod,white.adjust='hc3')
## Analysis of Deviance Table (Type II tests)
##
## Response: Lshell
## Df F Pr(>F)
## Site 5 9.9682 7.541e-08 ***
## Residuals 107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Specialist model. Probably the best for these particular data, but seek guidance. Don’t do this at home kids!
library(rjags)
library(ggmcmc)
rand_mod="
model {
### Likeihood
for (i in 1:N) { ## Loop through observations
mu[i]<-mu_r+Beta[ind[i]] ## Beta is added to an overall mean
y[i] ~ dnorm(mu[i],tau[ind[i]]) ## Set an independent tau for each group agan. A pooled variance model would also work here
}
for (j in 1:p) {
Beta[j]~dnorm(0,tau_r) ## A single tau represents the variance between group # means
tau[j] ~ dgamma(scale, rate)
for (n in 1:(j-1)){
Difbeta[n,j]<-Beta[n]-Beta[j]
}
}
scale ~ dunif(0, 1)
rate ~ dunif(0, 1)
tau_r ~ dgamma(scale,rate)
sigma_r <- 1/sqrt(tau_r)
mu_r ~ dnorm(0,0.00001) ## Prior for the overall mean
}"
data=list(y=d$Lshell,
ind=as.numeric(d$Site),
N=length(d$Lshell),
p=length(levels(d$Site)),
overall_mean=mean(d$Lshell))
model=jags.model(textConnection(rand_mod),data=data)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 113
## Unobserved stochastic nodes: 16
## Total graph size: 288
##
## Initializing model
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("sigma_r","mu_r","Difbeta","Beta"),n.iter=100000,thin=10)
ms <-ggs(output)
mt<-filter(ms,grepl("Beta",Parameter))
ggs_caterpillar(mt) +geom_vline(xintercept = 0,col="red")
summary(output)
##
## Iterations = 2010:102000
## Thinning interval = 10
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## Beta[1] -5.25675 4.328 0.04328 0.06730
## Beta[2] 2.11676 4.494 0.04494 0.06939
## Beta[3] 0.05458 6.198 0.06198 0.07012
## Beta[4] -4.98554 6.003 0.06003 0.06776
## Beta[5] 11.65420 4.325 0.04325 0.06983
## Beta[6] -3.16376 4.187 0.04187 0.06764
## Difbeta[1,2] -7.37351 3.638 0.03638 0.03638
## Difbeta[1,3] -5.31132 6.365 0.06365 0.06711
## Difbeta[2,3] 2.06218 6.422 0.06422 0.06833
## Difbeta[1,4] -0.27120 6.075 0.06075 0.06178
## Difbeta[2,4] 7.10230 6.316 0.06316 0.06513
## Difbeta[3,4] 5.04012 7.920 0.07920 0.07920
## Difbeta[1,5] -16.91095 3.305 0.03305 0.03305
## Difbeta[2,5] -9.53744 3.359 0.03359 0.03444
## Difbeta[3,5] -11.59963 6.303 0.06303 0.06829
## Difbeta[4,5] -16.63975 6.272 0.06272 0.06557
## Difbeta[1,6] -2.09298 3.109 0.03109 0.03109
## Difbeta[2,6] 5.28052 3.367 0.03367 0.03367
## Difbeta[3,6] 3.21834 6.244 0.06244 0.06570
## Difbeta[4,6] -1.82178 5.981 0.05981 0.06116
## Difbeta[5,6] 14.81797 2.886 0.02886 0.02886
## mu_r 106.74660 3.951 0.03951 0.06720
## sigma_r 8.14981 3.536 0.03536 0.04245
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## Beta[1] -13.901 -7.9295 -5.21739 -2.49602 3.1943
## Beta[2] -6.650 -0.7023 1.98656 4.83593 11.3639
## Beta[3] -12.377 -3.7535 0.07282 3.83567 12.4348
## Beta[4] -17.528 -8.6636 -4.73534 -1.08408 6.5583
## Beta[5] 3.638 8.8879 11.50164 14.23756 20.7918
## Beta[6] -11.492 -5.7566 -3.18225 -0.59175 5.3155
## Difbeta[1,2] -14.629 -9.7662 -7.33497 -4.95793 -0.3770
## Difbeta[1,3] -18.027 -9.4058 -5.27340 -1.17165 7.1391
## Difbeta[2,3] -10.630 -1.9425 1.97274 6.17541 15.0843
## Difbeta[1,4] -12.281 -4.2026 -0.28547 3.60867 12.0450
## Difbeta[2,4] -5.018 2.9301 7.00197 11.15334 19.9484
## Difbeta[3,4] -10.275 -0.1971 4.82235 10.11938 21.0235
## Difbeta[1,5] -23.316 -19.1179 -16.96167 -14.69969 -10.3788
## Difbeta[2,5] -16.249 -11.7504 -9.52503 -7.33269 -2.9418
## Difbeta[3,5] -24.264 -15.6190 -11.45570 -7.50275 0.4537
## Difbeta[4,5] -29.274 -20.7454 -16.58049 -12.43263 -4.4061
## Difbeta[1,6] -8.256 -4.1544 -2.09161 -0.02644 4.0146
## Difbeta[2,6] -1.279 3.0735 5.27675 7.50612 11.9100
## Difbeta[3,6] -9.191 -0.8038 3.23705 7.23649 15.6241
## Difbeta[4,6] -13.865 -5.6851 -1.81686 2.15195 10.0689
## Difbeta[5,6] 8.941 12.8882 14.85010 16.74177 20.4184
## mu_r 98.626 104.4199 106.83103 109.20197 114.3364
## sigma_r 3.713 5.7886 7.39611 9.60855 17.2102