Introduction

This document provides all the code chunks that may be useful in the context of the data analysis component of the assignment. The data set used to illustrate is the mussels data, that can be analysed using one way ANOVA and regression in the context of calibrating a relationship.

You should look through ALL the handouts provided on these techniques to understand the underlying theory. This “crib sheet” simply provides access to useful code.

ALWAYS CHECK THAT THE STEPS HAVE BEEN TAKEN IN THE RIGHT ORDER.
LOOK AT THE DATA YOU HAVE LOADED FIRST
USE YOUR OWN VARIABLE NAMES
PASTE IN CODE CHUNKS CAREFULLY; LEAVING GAPS BETWEEN EACH CHUNK
COMMENT ON ALL THE STEPS

Packages needed

Include this chunk at the top of you analysis to ensure that you have all the packages. It also includes the wrapper to add buttons to a data table if you want to use this. Remember that data tables can only be included in HTML documents.

library(ggplot2)
library(dplyr)
library(mgcv)
library(DT)
theme_set(theme_bw())
dt<-function(x) DT::datatable(x, 
    filter = "top",                         
    extensions = c('Buttons'), options = list(
    dom = 'Blfrtip',
    buttons = c('copy', 'csv', 'excel'), colReorder = TRUE
  ))

Univariate

Data

d<-read.csv("https://tinyurl.com/aqm-data/mussels.csv")
dt(d)

Data summaries for individual variables

Change the name of the variable to match a numerical variable in your own data set. The command removes NAs just in case you have them

summary(d$Lshell,na.rm=TRUE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.9    97.0   106.9   106.8   118.7   132.6

Individual statistics for a single variable

Mean, median, standard deviation and variance.

mean(d$Lshell, na.rm=TRUE)

## [1] 106.835

median(d$Lshell, na.rm=TRUE)

## [1] 106.9

sd(d$Lshell, na.rm=TRUE)

## [1] 14.84384

var(d$Lshell, na.rm=TRUE)

## [1] 220.3397

Simple boxplot of one variable

Useful for your own quick visualisation.

boxplot(d$Lshell)

Simple histogram of one variable

Useful for your own quick visualisation.

hist(d$Lshell)

Neater histogram of one variable

This uses ggplot. Change the bin width if you want to use this.

g0<-ggplot(d,aes(x=d$Lshell))
g0+geom_histogram(color="grey",binwidth = 5)

Regression

Data

In this data set there are two numerical variables. So we can run a linear regresion.

d<-read.csv("https://tinyurl.com/aqm-data/mussels.csv")
dt(d)

Scatterplot without fitted line

g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g0+geom_point()

Scatterplot with fitted line and labels

Type the text you want for the x and y axes to replace the variable names

g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g1<-g0+geom_point() + geom_smooth(method="lm") 
g1 + xlab("Some text for the x asis") + ylab("Some text for the y axis")

Fitting a model

Change the names of the variables in the first line.

mod<-lm(data= d, BTVolume~Lshell)

Model summary

summary(mod)

## 
## Call:
## lm(formula = BTVolume ~ Lshell, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.828  -2.672   0.147   2.235  17.404 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.02385    3.33917  -10.79   <2e-16 ***
## Lshell        0.59754    0.03096   19.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.864 on 111 degrees of freedom
## Multiple R-squared:  0.7704, Adjusted R-squared:  0.7684 
## F-statistic: 372.5 on 1 and 111 DF,  p-value: < 2.2e-16

Model anova table

anova(mod)

## Analysis of Variance Table
## 
## Response: BTVolume
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Lshell      1 8811.4  8811.4  372.49 < 2.2e-16 ***
## Residuals 111 2625.7    23.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confidence intervals for the model parameters

confint(mod)

##                   2.5 %      97.5 %
## (Intercept) -42.6406346 -29.4070662
## Lshell        0.5361881   0.6588891

Extracting residuals

d$residuals<-residuals(mod)

Model diagnostics

Look at the regression handout to understand these plots.

plot(mod,which=1)

plot(mod,which=2)

plot(mod,which=3)

plot(mod,which=4)

plot(mod,which=5)

Spearman’s rank correlation

Used if all else fails. Not needed with these data, but included for reference.

g0<-ggplot(d,aes(x=rank(Lshell),y=rank(BTVolume)))
g0+geom_point() + geom_smooth(method="lm")

cor.test(d$Lshell,d$BTVolume,method="spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  d$Lshell and d$BTVolume
## S = 26143, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8912809

Fitting a spline

Only use if you suspect that the relationship is not well described by a straight line.

library(mgcv)

g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g1<-g0 + geom_point() + geom_smooth(method="gam", formula =y~s(x))
g1 + xlab("Some text for the x asis") + ylab("Some text for the y axis")

In this case the line is the same as the linear model. Get a summary using this code.

mod<-gam(data=d, BTVolume~s(Lshell))
summary(mod)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## BTVolume ~ s(Lshell)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.8142     0.4557   61.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df     F p-value    
## s(Lshell) 1.493  1.847 198.8  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.77   Deviance explained = 77.3%
## GCV = 23.993  Scale est. = 23.463    n = 113

If you do use this model remember that its only needed if you can’t use linear regression. Report the ajusted R squared value, the estimated degrees of freedom and the p-value for the smooth term (not the intercept). You must include the figure in your report, as that is the only way to show the shape of the response.

One way ANOVA

The purpose of one way anova is

Test whether there is greater variability between groups than within groups
Quantify any differences found between group means

Grouped boxplots

Exploratory plots

g0<-ggplot(d,aes(x=Site,y=Lshell))
g0+geom_boxplot()

Histograms for each factor level

g0<-ggplot(d,aes(x=d$Lshell))
g1<-g0+geom_histogram(color="grey",binwidth = 5)

g1+facet_wrap(~Site) +xlab("Text for x label")

Confidence interval plot

g0<-ggplot(d,aes(x=Site,y=Lshell))
g1<-g0+stat_summary(fun.y=mean,geom="point")
g1<-g1 +stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g1 +xlab("Text for x label") + ylab("Text for y label")

Fitting ANOVA

mod<-lm(data=d,Lshell~Site)

Model anova

anova(mod)

## Analysis of Variance Table
## 
## Response: Lshell
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Site        5   5525    1105  6.1732 4.579e-05 ***
## Residuals 107  19153     179                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model summary

Treatment effects

summary(mod)

## 
## Call:
## lm(formula = Lshell ~ Site, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.906  -8.340   1.031   9.231  30.550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  100.769      2.624  38.405  < 2e-16 ***
## SiteSite_2     8.467      3.748   2.259   0.0259 *  
## SiteSite_3     6.037      5.409   1.116   0.2669    
## SiteSite_4    -3.619      5.409  -0.669   0.5049    
## SiteSite_5    18.697      3.925   4.763 6.02e-06 ***
## SiteSite_6     2.471      3.748   0.659   0.5111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared:  0.2239, Adjusted R-squared:  0.1876 
## F-statistic: 6.173 on 5 and 107 DF,  p-value: 4.579e-05

Change reference level

slevels<-levels(d$Site)
d$Site<-relevel(d$Site,"Site_5")
mod<-lm(data=d,Lshell~Site)
summary(mod)

## 
## Call:
## lm(formula = Lshell ~ Site, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.906  -8.340   1.031   9.231  30.550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  119.467      2.920  40.919  < 2e-16 ***
## SiteSite_1   -18.697      3.925  -4.763 6.02e-06 ***
## SiteSite_2   -10.231      3.960  -2.583 0.011135 *  
## SiteSite_3   -12.660      5.559  -2.278 0.024739 *  
## SiteSite_4   -22.317      5.559  -4.015 0.000111 ***
## SiteSite_6   -16.227      3.960  -4.097 8.15e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared:  0.2239, Adjusted R-squared:  0.1876 
## F-statistic: 6.173 on 5 and 107 DF,  p-value: 4.579e-05

d$Site <- factor(d$Site, levels=slevels)

Reverse levels

slevels<-levels(d$Site)
d$Site <- factor(d$Site, levels=rev(slevels))
mod<-lm(data=d,Lshell~Site)
summary(mod)

## 
## Call:
## lm(formula = Lshell ~ Site, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.906  -8.340   1.031   9.231  30.550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  103.240      2.676  38.583  < 2e-16 ***
## SiteSite_5    16.227      3.960   4.097 8.15e-05 ***
## SiteSite_4    -6.090      5.435  -1.121    0.265    
## SiteSite_3     3.566      5.435   0.656    0.513    
## SiteSite_2     5.996      3.784   1.584    0.116    
## SiteSite_1    -2.471      3.748  -0.659    0.511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared:  0.2239, Adjusted R-squared:  0.1876 
## F-statistic: 6.173 on 5 and 107 DF,  p-value: 4.579e-05

d$Site <- factor(d$Site, levels=slevels)

Sum contrasts

Sum contrasts compare the effects to the mean. Notice that the last level is missing due to the way the design matrix is formed. So it can be worth running sum contrasts twice, with the order reversed, to get all the contrasts.

options(contrasts = c("contr.sum", "contr.poly"))
mod<-lm(data=d,Lshell~Site)
summary(mod)

## 
## Call:
## lm(formula = Lshell ~ Site, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.906  -8.340   1.031   9.231  30.550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 106.1114     1.4384  73.773  < 2e-16 ***
## Site1        -5.3421     2.5804  -2.070   0.0408 *  
## Site2         3.1246     2.6158   1.195   0.2349    
## Site3         0.6949     4.1214   0.169   0.8664    
## Site4        -8.9614     4.1214  -2.174   0.0319 *  
## Site5        13.3553     2.7841   4.797 5.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared:  0.2239, Adjusted R-squared:  0.1876 
## F-statistic: 6.173 on 5 and 107 DF,  p-value: 4.579e-05

options(contrasts = c("contr.treatment", "contr.poly"))

Reverse order

d$Site <- factor(d$Site, levels=rev(slevels))
options(contrasts = c("contr.sum", "contr.poly"))
mod<-lm(data=d,Lshell~Site)
summary(mod)

## 
## Call:
## lm(formula = Lshell ~ Site, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.906  -8.340   1.031   9.231  30.550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 106.1114     1.4384  73.773  < 2e-16 ***
## Site1        -2.8714     2.6158  -1.098   0.2748    
## Site2        13.3553     2.7841   4.797 5.24e-06 ***
## Site3        -8.9614     4.1214  -2.174   0.0319 *  
## Site4         0.6949     4.1214   0.169   0.8664    
## Site5         3.1246     2.6158   1.195   0.2349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.38 on 107 degrees of freedom
## Multiple R-squared:  0.2239, Adjusted R-squared:  0.1876 
## F-statistic: 6.173 on 5 and 107 DF,  p-value: 4.579e-05

options(contrasts = c("contr.treatment", "contr.poly"))
d$Site <- factor(d$Site, levels=slevels)

Tukey corrected pairwise comparisons

Use to find where signficant differences lie. This should confirm the pattern shown using the confidence interval plot.

mod<-aov(data=d,Lshell~Site)
TukeyHSD(mod)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Lshell ~ Site, data = d)
## 
## $Site
##                     diff        lwr       upr     p adj
## Site_2-Site_1   8.466769  -2.408905 19.342443 0.2201442
## Site_3-Site_1   6.037019  -9.660664 21.734702 0.8737518
## Site_4-Site_1  -3.619231 -19.316914 12.078452 0.9849444
## Site_5-Site_1  18.697436   7.305950 30.088922 0.0000867
## Site_6-Site_1   2.470769  -8.404905 13.346443 0.9859123
## Site_3-Site_2  -2.429750 -18.201132 13.341632 0.9976925
## Site_4-Site_2 -12.086000 -27.857382  3.685382 0.2355928
## Site_5-Site_2  10.230667  -1.262165 21.723498 0.1103764
## Site_6-Site_2  -5.996000 -16.977781  4.985781 0.6105029
## Site_4-Site_3  -9.656250 -29.069479  9.756979 0.7004668
## Site_5-Site_3  12.660417  -3.470986 28.791819 0.2123990
## Site_6-Site_3  -3.566250 -19.337632 12.205132 0.9862071
## Site_5-Site_4  22.316667   6.185264 38.448069 0.0015143
## Site_6-Site_4   6.090000  -9.681382 21.861382 0.8718474
## Site_6-Site_5 -16.226667 -27.719498 -4.733835 0.0011239

Plot of results of Tukey HSD

plot(TukeyHSD(mod))

Anova with White’s correction

This will give you the overall Anova table if there is heterogeneity of variance.

library(sandwich)
library(car)
mod<-lm(Lshell~Site, data=d)
Anova(mod,white.adjust='hc3')

## Analysis of Deviance Table (Type II tests)
## 
## Response: Lshell
##            Df      F    Pr(>F)    
## Site        5 9.9682 7.541e-08 ***
## Residuals 107                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bayesian model with shrinkage

Specialist model. Probably the best for these particular data, but seek guidance. Don’t do this at home kids!

library(rjags)
library(ggmcmc)
rand_mod="
  model {
      ### Likeihood
      for (i in 1:N) {                  ## Loop through observations
        mu[i]<-mu_r+Beta[ind[i]]        ## Beta is added to an overall mean
        y[i] ~ dnorm(mu[i],tau[ind[i]])  ## Set an independent tau for each group agan. A                                              pooled variance model would also work  here
       
      }
    
    for (j in 1:p) {
     Beta[j]~dnorm(0,tau_r)            ## A single tau represents the variance between group                                        #   means
     tau[j] ~ dgamma(scale, rate)
      for (n in 1:(j-1)){
        Difbeta[n,j]<-Beta[n]-Beta[j]
      }
    }
   
    scale ~ dunif(0, 1)
    rate ~ dunif(0, 1)
    tau_r ~ dgamma(scale,rate)
    sigma_r <- 1/sqrt(tau_r)
    mu_r ~ dnorm(0,0.00001)     ## Prior for the overall mean
   
  }"

data=list(y=d$Lshell,
          ind=as.numeric(d$Site),
          N=length(d$Lshell),
          p=length(levels(d$Site)),
          overall_mean=mean(d$Lshell))
model=jags.model(textConnection(rand_mod),data=data)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 113
##    Unobserved stochastic nodes: 16
##    Total graph size: 288
## 
## Initializing model

update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("sigma_r","mu_r","Difbeta","Beta"),n.iter=100000,thin=10)
ms <-ggs(output) 
mt<-filter(ms,grepl("Beta",Parameter))
ggs_caterpillar(mt) +geom_vline(xintercept = 0,col="red")

summary(output)

## 
## Iterations = 2010:102000
## Thinning interval = 10 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                   Mean    SD Naive SE Time-series SE
## Beta[1]       -5.25675 4.328  0.04328        0.06730
## Beta[2]        2.11676 4.494  0.04494        0.06939
## Beta[3]        0.05458 6.198  0.06198        0.07012
## Beta[4]       -4.98554 6.003  0.06003        0.06776
## Beta[5]       11.65420 4.325  0.04325        0.06983
## Beta[6]       -3.16376 4.187  0.04187        0.06764
## Difbeta[1,2]  -7.37351 3.638  0.03638        0.03638
## Difbeta[1,3]  -5.31132 6.365  0.06365        0.06711
## Difbeta[2,3]   2.06218 6.422  0.06422        0.06833
## Difbeta[1,4]  -0.27120 6.075  0.06075        0.06178
## Difbeta[2,4]   7.10230 6.316  0.06316        0.06513
## Difbeta[3,4]   5.04012 7.920  0.07920        0.07920
## Difbeta[1,5] -16.91095 3.305  0.03305        0.03305
## Difbeta[2,5]  -9.53744 3.359  0.03359        0.03444
## Difbeta[3,5] -11.59963 6.303  0.06303        0.06829
## Difbeta[4,5] -16.63975 6.272  0.06272        0.06557
## Difbeta[1,6]  -2.09298 3.109  0.03109        0.03109
## Difbeta[2,6]   5.28052 3.367  0.03367        0.03367
## Difbeta[3,6]   3.21834 6.244  0.06244        0.06570
## Difbeta[4,6]  -1.82178 5.981  0.05981        0.06116
## Difbeta[5,6]  14.81797 2.886  0.02886        0.02886
## mu_r         106.74660 3.951  0.03951        0.06720
## sigma_r        8.14981 3.536  0.03536        0.04245
## 
## 2. Quantiles for each variable:
## 
##                 2.5%      25%       50%       75%    97.5%
## Beta[1]      -13.901  -7.9295  -5.21739  -2.49602   3.1943
## Beta[2]       -6.650  -0.7023   1.98656   4.83593  11.3639
## Beta[3]      -12.377  -3.7535   0.07282   3.83567  12.4348
## Beta[4]      -17.528  -8.6636  -4.73534  -1.08408   6.5583
## Beta[5]        3.638   8.8879  11.50164  14.23756  20.7918
## Beta[6]      -11.492  -5.7566  -3.18225  -0.59175   5.3155
## Difbeta[1,2] -14.629  -9.7662  -7.33497  -4.95793  -0.3770
## Difbeta[1,3] -18.027  -9.4058  -5.27340  -1.17165   7.1391
## Difbeta[2,3] -10.630  -1.9425   1.97274   6.17541  15.0843
## Difbeta[1,4] -12.281  -4.2026  -0.28547   3.60867  12.0450
## Difbeta[2,4]  -5.018   2.9301   7.00197  11.15334  19.9484
## Difbeta[3,4] -10.275  -0.1971   4.82235  10.11938  21.0235
## Difbeta[1,5] -23.316 -19.1179 -16.96167 -14.69969 -10.3788
## Difbeta[2,5] -16.249 -11.7504  -9.52503  -7.33269  -2.9418
## Difbeta[3,5] -24.264 -15.6190 -11.45570  -7.50275   0.4537
## Difbeta[4,5] -29.274 -20.7454 -16.58049 -12.43263  -4.4061
## Difbeta[1,6]  -8.256  -4.1544  -2.09161  -0.02644   4.0146
## Difbeta[2,6]  -1.279   3.0735   5.27675   7.50612  11.9100
## Difbeta[3,6]  -9.191  -0.8038   3.23705   7.23649  15.6241
## Difbeta[4,6] -13.865  -5.6851  -1.81686   2.15195  10.0689
## Difbeta[5,6]   8.941  12.8882  14.85010  16.74177  20.4184
## mu_r          98.626 104.4199 106.83103 109.20197 114.3364
## sigma_r        3.713   5.7886   7.39611   9.60855  17.2102

Crib sheet for regression and ANOVA

Duncan Golicher

12/5/2017

Introduction

Packages needed

Univariate

Data

Data summaries for individual variables

Individual statistics for a single variable

Simple boxplot of one variable

Simple histogram of one variable

Neater histogram of one variable

Regression

Data

Scatterplot without fitted line

Scatterplot with fitted line and labels

Fitting a model

Model summary

Model anova table

Confidence intervals for the model parameters

Extracting residuals

Model diagnostics

Spearman’s rank correlation

Fitting a spline

One way ANOVA

Grouped boxplots

Histograms for each factor level

Confidence interval plot

Fitting ANOVA

Model anova

Model summary

Treatment effects

Change reference level

Reverse levels

Sum contrasts

Reverse order

Tukey corrected pairwise comparisons

Anova with White’s correction

Bayesian model with shrinkage