Chapter 16 Generalised linear models

So far we have assumed throughout that the variability in our models takes an approximately normal form. This is the assumption used in the classical parametric statistical tests and in regression, ANOVA and ANCOVA. Violations of the assumption often lead to the adoption of simple non parametric tests instead of more informative model based procedures due to worries about not meeting the assumptions needed for parametric modelling. However the set of models, known as Generalised Linear Models (GLMs) can use any known distribution for the errors. These are very powerful techniques. They are not much more difficult to apply using R than the methods that you have already seen. However careful thought is required in order to find the correct form for the model.

16.1 Poisson regression

Let’s look at the marine invertebrates data that we saw earlier.

d<-read.csv("https://tinyurl.com/aqm-data/marineinverts.csv")
str(d)
## 'data.frame':    45 obs. of  4 variables:
##  $ richness: int  0 2 8 13 17 10 10 9 19 8 ...
##  $ grain   : num  450 370 192 194 197 ...
##  $ height  : num  2.255 0.865 1.19 -1.336 -1.334 ...
##  $ salinity: num  27.1 27.1 29.6 29.4 29.6 29.4 29.4 29.6 29.6 29.6 ...

Plotting species richness against grain size again.

attach(d)
plot(d$richness~d$grain)

In the previous analysis we looked at how allowing the model to adopt a curved form led to a better fit. However the issue of the inappropriate use of the normal distribution to represent the error term was ignored.

One way of thinking about the situation is to remember that the description of a regression line includes some statement about the errors.

\(y=a+bx+\epsilon\) where \(\epsilon=N(o,\sigma^{2})\)

This equation should be able to describe the process that leads to each data point. The model has a deterministic component (the regression line) and a stochastic component (the error term). However when the points are counts a continuous error term is incorrect. Although the mean value (trend) does not have to be an integer value, the actual data values do. So the errors around the trend should be discrete.

The poisson distribution can represent this. For any value of lambda (which is continuous) the probability distribution of values is discrete. The poisson distribution automatically builds in heterogeniety of variance as the variance of a poisson distribution is in fact equal to lambda.

par(mfcol=c(2,2))
barplot(dpois(0:5,lambda=0.1),names=0:5,main="Lambda=0.1")
barplot(dpois(0:5,lambda=0.5),names=0:5,main="Lambda=0.5")
barplot(dpois(0:5,lambda=1),names=0:5,main="Lambda=1")
barplot(dpois(0:5,lambda=2),names=0:5,main="Lambda=2")

Let’s think of a regression line with poisson errors with a=0, and b=1.

\(y=a+bx+\epsilon\) where \(\epsilon=poisson(lambda=y)\)

Something interesting happens in this case. Lambda is a measure of the central tendency, but for most of the regression line no observations can actually take the value of lambda. A point can only fall on the line when lambda happens to be an integer.

lambda<-seq(0,10,length=200)
plot(lambda,rpois(200,lambda),pch=21,bg=2)
lines(lambda,lambda,lwd=2)
abline(v=0:10,lty=2)

This is the motive for fitting using maximum likelihood. A point that falls a long way away from the deterministic component of a model contributes more to the model’s deviance than one that is close. A model with a low total deviance has a higher likelihood than one with a high deviance. The probabilities (that contribute to the deviance) are determined from assumptions regarding the form of the stochastic component of the model. The normal distribution is only one form of determining these probabilities. There are many other possible distributions for the error term.

So let’s fit the model again, this time using poisson regression. By default this uses a log link function. This is usually appropriate for count data that cannot fall below zero. In this case the logarithmic link function also deals nicely with the problem of curvilinearity of the response.

mod1<-glm(data=d,richness ~ grain,family=poisson)
summary(mod1)
## 
## Call:
## glm(formula = richness ~ grain, family = poisson, data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4828  -1.3897  -0.0732   0.8644   2.5838  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.238393   0.299033  14.174  < 2e-16 ***
## grain       -0.009496   0.001179  -8.052 8.16e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 179.75  on 44  degrees of freedom
## Residual deviance: 105.35  on 43  degrees of freedom
## AIC: 251.35
## 
## Number of Fisher Scoring iterations: 5
confint(mod1)
## Waiting for profiling to be done...
##                   2.5 %       97.5 %
## (Intercept)  3.65451714  4.827458978
## grain       -0.01185141 -0.007224939

Plotting the model shows it’s form. Note that with when fitting a GLM in R we can ask for the standard errors and produce approximate confidence intervals using them.

plot(d$richness ~ d$grain)
x<-seq(min(d$grain),max(d$grain),length=100)
a<-predict(mod1,newdata=list(grain=x),type="response",se=T) 
lines(x,a$fit-2*a$se.fit,lty=2)
lines(x,a$fit+2*a$se.fit,lty=2)
lines(x,a$fit)

16.2 GGplot

Its easy to add a glm to a ggplot scatterplot. However be careful to add in the methods.args.

library(ggplot2)
g0<-ggplot(d,aes(x=grain,y=richness))
glm1<-g0+geom_point()+stat_smooth(method="glm",method.args=list( family="poisson"), se=TRUE) +ggtitle("Poisson regression with log link function")
glm1

16.3 Showing the results with logged y

This is not a good approach, as the zeros are lost, but it demonstrates the idea.

plot(d$richness ~d$grain, log="y")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 3 y values <= 0 omitted
## from logarithmic plot
x<-seq(min(grain),max(grain),length=100)
a<-predict(mod1,newdata=list(grain=x),type="response",se=T) 
lines(x,a$fit-2*a$se.fit,lty=2)
lines(x,a$fit+2*a$se.fit,lty=2)
lines(x,a$fit)

16.5 Comparing the results

library(ggplot2)
g0<-ggplot(d,aes(x=grain,y=richness))
glm1<-g0+geom_point()+stat_smooth(method="glm",method.args=list( family="poisson"), se=TRUE) 
glm3<-glm1+geom_point()+geom_smooth(method="glm.nb", se=TRUE,color="red") 
glm3

library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
modh<-hurdle(d$richness~grain,dist="negbin")
summary(modh)
## 
## Call:
## hurdle(formula = d$richness ~ grain, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5804 -0.8495 -0.1085  0.6236  2.0657 
## 
## Count model coefficients (truncated negbin with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.915924   0.458802   8.535  < 2e-16 ***
## grain       -0.008220   0.001724  -4.767 1.87e-06 ***
## Log(theta)   1.510256   0.510987   2.956  0.00312 ** 
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  7.34139    3.39460   2.163   0.0306 *
## grain       -0.01531    0.01005  -1.524   0.1275  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 4.5279
## Number of iterations in BFGS optimization: 13 
## Log-likelihood: -114.5 on 5 Df
AIC(modh)
## [1] 238.9305
AIC(mod3)
## [1] 235.3695
confint(modh)
##                         2.5 %       97.5 %
## count_(Intercept)  3.01668827  4.815159689
## count_grain       -0.01159973 -0.004840422
## zero_(Intercept)   0.68811013 13.994679574
## zero_grain        -0.03500187  0.004381010
modzi <- zeroinfl(data=d,richness~grain,dist="negbin")
summary(modzi)
## 
## Call:
## zeroinfl(formula = richness ~ grain, data = d, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5740 -0.8408 -0.1085  0.6161  2.0427 
## 
## Count model coefficients (negbin with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.896314   0.446411   8.728  < 2e-16 ***
## grain       -0.008132   0.001659  -4.902 9.49e-07 ***
## Log(theta)   1.514259   0.514694   2.942  0.00326 ** 
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.37854   11.19100  -0.481    0.631
## grain        0.00447    0.03878   0.115    0.908
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 4.5461 
## Number of iterations in BFGS optimization: 35 
## Log-likelihood: -114.6 on 5 Df
AIC(modh)
## [1] 238.9305
AIC(modzi)
## [1] 239.1852
AIC(mod3)
## [1] 235.3695

16.6 Models with binomial errors

The most commonly used GL is probably logistic regression. In this particular model the response can only take values of zero or one. Thus it is clear from the outset that errors cannot be normal. Let’s set up a simple simulated data set to show how this works. Imagine we are interested in mortality of pine trees following a ground fire. We might assume that the population of tree diameters are log normally distributed with a mean of twenty.

set.seed(1)
diam<-sort(rlnorm(500,mean=log(20),sd=0.5))
summary(diam)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.445  14.660  19.636  23.018  28.079 134.407
hist(diam,col="grey",breaks=10)

Let’s simulate some response data based on an extremely simple underlying pattern for tree mortality. We might assume that trees with diameters of over 40 cm have bark that has reached a thickness that prevents the tree being killed by the fire. We might also assume a simple linear relationship between diameter and mortality up to this threshold and build a simple rule based vector of the probability that a tree survives the fire as a function of its diameter.

p<-diam/50 
p[p>1]<-1 
plot(diam,p,ylab="Survival probability",xlab="Diameter",type="l",lwd=3)

Although we have a very simple underlying deterministic model, we will not see this directly when we collect data. Any individual tree will be either alive or dead. Thus our response will be zeros and ones. This is the problem that logistic regression deals with very neatly without the need to calculate proportions explicitly.

f<-function(x)rbinom(1,1,x)
response<-as.vector(sapply(p,f))
head(response)
## [1] 0 0 0 1 0 0
d<-data.frame(diam,response)
plot(diam,response)
lines(diam,p,lwd=3)

The task for the statistical model is to take this input and turn it back into a response model. Generalised linear models do this using a link function. In R it is very easy to specify the model. We simply write a model using the same syntax as for a linear model (one with gaussian errors) but we state the family of models we wish to use as binomial.

mod1<-glm(response~diam,family="binomial")
summary(mod1)
## 
## Call:
## glm(formula = response ~ diam, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8202  -0.8891  -0.6053   1.0175   2.0428  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.60771    0.28033  -9.302   <2e-16 ***
## diam         0.10869    0.01217   8.929   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 688.91  on 499  degrees of freedom
## Residual deviance: 565.51  on 498  degrees of freedom
## AIC: 569.51
## 
## Number of Fisher Scoring iterations: 5

We can see that R does find a model that matches the underlying pattern very well by using the model for prediction. Again we visualise the model in order to understand it. This is always preferable to trying to understand a model from a table of numbers. Visualisation is particularly important for models with parameters expressed on a logit scale as this is not intuitive.

g0 <- ggplot(d,aes(x=diam,y=response))
g1<-g0+geom_point()+stat_smooth(method="glm",method.args=list(family="binomial")) 
g1

If we wanted to check whether there was a response shape that differed from that assumed by the general linear model we could try a general additive model with a smoother.

library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.
g1<-g0+geom_point()+stat_smooth(method="gam",method.args=list(family="binomial"),formula=y~s(x)) 
g1

The curve is very similar. Note that as the smoother uses a form of “local” regression the confidence intervals expand in areas where there is little data.

In some cases the response would take a different form. This could happen if there were some optimum point at which some response occurred, for example the occurence of a species along an altitudinal gradient or shoreline. In this case the gam model would fit the data better than the linear model. We will look at how this can be tested formally later. A quick test is to calculate the AIC. If this is much lower for the gam it indicates that the gam may be a better fit.

glm_mod<-glm(data=d, response~diam, family=binomial)
gam_mod<-gam(data=d, response~s(diam), family=binomial)

AIC(glm_mod)
## [1] 569.5078
AIC(gam_mod)
## [1] 566.4594

In this case it is very slightly lower, but not enough to suggest the use of a gam.

16.8 Exercises

  1. GLMS can also be used when the explanatory variable is a factor. Here is a very simple data set that consists of counts of ragworm in two types of substrate, classified simply into mud and sand. Analyse the data using both a general linear model and a generalised linear model. Comment on the differences between the two aproaches.
d<-read.csv("/home/aqm/course/data/HedisteCounts.csv")
  1. Binomial (prensence/absence) model

In some cases the actual numbers of organisms counted can be a poor choice of response variable. If organisms are highly aggregated then presence vs absence is a better choice. Reanalyse the ragworm data, this time using presence as the response.

d$pres<-1*(d$Count>0) ## This sets up a variable consisting of ones and zeros
  1. Leafminers and leaf exposure to light

The number of leaf miners were counted on 200 leaves exposed to different levels of ambient light, measured as a percentage of full exposure.

Analyse these data using an appropriate GLM.

d<-read.csv("/home/aqm/course/data/leafminers.csv")
plot(d)

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
g0<-ggplot(d,aes(x=light,y=nminers))
glm1<-g0+geom_point()+stat_smooth(method="glm",method.args=list( family="poisson"), se=TRUE) +ggtitle("Poisson regression with log link function")
ggplotly(glm1)
mod<-glm(data=d,nminers~light,family="poisson")
summary(mod)
## 
## Call:
## glm(formula = nminers ~ light, family = "poisson", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0888  -1.7275  -1.1676   0.0864   5.9691  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.36721    1.09865  -4.885 1.03e-06 ***
## light        0.06610    0.01203   5.496 3.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 751.31  on 199  degrees of freedom
## Residual deviance: 720.43  on 198  degrees of freedom
## AIC: 1025.3
## 
## Number of Fisher Scoring iterations: 6
log(2)/coef(mod)[2]
##    light 
## 10.48647