Chapter 25 Solution to GLM exercise

theme_set(theme_bw())
d<-read.csv("/home/aqm/course/data/HedisteCounts.csv")

25.1 Visualise the data

The best way to look at these data quickly is by plotting histograms of the counts. I’ll use ggplot and facet wrapping

ggplot(d, aes(x=Count)) +geom_histogram(binwidth=3,fill="grey", col="black") +facet_wrap(~Substrate)

25.2 Simple non parametric

So its very obvious that the data are not normally distributed and there are lots of zeros. So, an undergraduate would probably use a Wilcoxon rank sum test (Mann Whitney in SPSS) in order to get a p-value.

wilcox.test(d$Count~d$Substrate)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  d$Count by d$Substrate
## W = 2006, p-value = 0.0005577
## alternative hypothesis: true location shift is not equal to 0

Great! We’ve found a significant difference. However, where is the effect size?

Let’s get the means anyway.

d%>% group_by(Substrate) %>% summarise(mean(Count))

## # A tibble: 2 x 2
##   Substrate `mean(Count)`
## * <fct>             <dbl>
## 1 Mud                3.8 
## 2 Sand               1.02

Looks like there are around four times as many ragworm in the mud than the sand. But we can’t say that based on the non-parametric test.

25.3 T test

Let’s do a T test

t.test(d$Count~d$Substrate)

## 
##  Welch Two Sample t-test
## 
## data:  d$Count by d$Substrate
## t = 3.561, df = 62.74, p-value = 0.0007115
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.221265 4.345402
## sample estimates:
##  mean in group Mud mean in group Sand 
##           3.800000           1.016667

Yippee! It’s significant again. But my tutor is gong to tell me to put it into SPSS and SPSS will tell me that the test isn’t valid.

25.4 One way Anova

mod<-lm(data=d,Count~Substrate)
anova(mod)

## Analysis of Variance Table
## 
## Response: Count
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## Substrate   1  211.28  211.28  14.451 0.0002385 ***
## Residuals 108 1578.98   14.62                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test residuals for normality

shapiro.test(residuals(mod))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod)
## W = 0.7562, p-value = 3.152e-12

Sugar. Not good.

Test for homogeneity of variance?

bartlett.test(residuals(mod)~d$Substrate)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  residuals(mod) by d$Substrate
## Bartlett's K-squared = 39.177, df = 1, p-value = 3.871e-10

Not valid. So, I’ll have to use the wilcox test?

25.5 Poisson regression

Now I’m a master’s student I’m more sophisticated. I’ll use a GLM of the poisson family with a log link function. That will convince my tutor!

mod<-glm(data=d,Count~Substrate,family="poisson")
summary(mod)

## 
## Call:
## glm(formula = Count ~ Substrate, family = "poisson", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7568  -1.4259  -1.4259   0.4656   5.8335  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.33500    0.07255  18.402   <2e-16 ***
## SubstrateSand -1.31847    0.14716  -8.959   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 600.35  on 109  degrees of freedom
## Residual deviance: 505.17  on 108  degrees of freedom
## AIC: 654.94
## 
## Number of Fisher Scoring iterations: 6

Oops. I’ve spotted over dispersion. Should have seen that one coming from the histograms.

25.6 Negative binomial

But… I’m a very savvy MSC student because I took the AQM course. I have an answer to that. I’ll use a glm of the negative binomial family.

library(MASS)
mod<-glm.nb(data=d,Count~Substrate)
summary(mod)

## 
## Call:
## glm.nb(formula = Count ~ Substrate, data = d, init.theta = 0.2727104969, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2143  -0.9205  -0.9205   0.1156   1.6322  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.3350     0.2804   4.762 1.92e-06 ***
## SubstrateSand  -1.3185     0.3951  -3.337 0.000847 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.2727) family taken to be 1)
## 
##     Null deviance: 100.795  on 109  degrees of freedom
## Residual deviance:  89.795  on 108  degrees of freedom
## AIC: 385.39
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.2727 
##           Std. Err.:  0.0608 
## 
##  2 x log-likelihood:  -379.3940

That looks much better.

Let’s try to understand the coefficients.

exp(coef(mod))

##   (Intercept) SubstrateSand 
##     3.8000000     0.2675439

Aha. The first is the mean for mud. What is the second one?

exp(coef(mod)[1]) * exp(coef(mod)[2])

## (Intercept) 
##    1.016667

Right. It’s the value that I need to multiply the count in mud by in order to get the count in sand.

Effect size with confidence intervals?

exp(confint(mod))

##                   2.5 %    97.5 %
## (Intercept)   2.2821868 6.9085692
## SubstrateSand 0.1216029 0.5794958

So the effect lies between 12% and 58%. We have an answer!

25.7 Presence absence

Converting to presence absence.

d$pres<-1*(d$Count>0)

Tabulate the data.

table(d$pres,d$Substrate)

##    
##     Mud Sand
##   0  23   44
##   1  27   16

Plot as bar chart. (An undergrad would use Excel)

d %>% group_by(Substrate,pres) %>% summarise(n=n()) %>%
  ggplot(aes(x=Substrate,y=n, fill=as.factor(pres))) + geom_col(position="dodge")

Chi square test

It’s a 2x2 contingency table. I can do a chi squared!

chisq.test( table(d$pres,d$Substrate))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(d$pres, d$Substrate)
## X-squared = 7.4482, df = 1, p-value = 0.00635

Yippee! It’s significant. But … where is the effect size? Fit a binomial glm

mod<-glm(data=d,pres~Substrate,family="binomial")
summary(mod)

## 
## Call:
## glm(formula = pres ~ Substrate, family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2462  -0.7876  -0.7876   1.1101   1.6259  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     0.1603     0.2838   0.565  0.57202   
## SubstrateSand  -1.1719     0.4071  -2.879  0.00399 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 147.21  on 109  degrees of freedom
## Residual deviance: 138.58  on 108  degrees of freedom
## AIC: 142.58
## 
## Number of Fisher Scoring iterations: 4

exp(coef(mod))

##   (Intercept) SubstrateSand 
##     1.1739130     0.3097643

Hang on… how do we interpret these exponentiated coeficients? They are odds.

table(d$pres,d$Substrate)

##    
##     Mud Sand
##   0  23   44
##   1  27   16

So 27 to 23 for presence in mud gives ..

27/23

## [1] 1.173913

Multiply this by the “relative risk” of finding ragworm in sand.

1.173913 * 0.3097643

## [1] 0.3636363

Should be the same as

16/44

## [1] 0.3636364

Which it is.

So when we get confidence intervals the key value refers to the odds ratio.

exp(confint(mod))

##                   2.5 %   97.5 %
## (Intercept)   0.6735089 2.063209
## SubstrateSand 0.1369500 0.679671

So we have got a confidence interval for a meaningful quantity, but we have unfortunately entered the interpretabilty minefield thrown up by relative odds ratios. Still better than using a chi squared!

We end up by being able to state that the relative odds of finding ragworm present in “sand” cores compared to “mud” cores (note that the variable itself is based on a binary split) is between 14% and 68%. Compare this to the difference between the result for the analysis of abundance. It is surprisingly similar. This illustrates that there is not really that much information in the actual counts. Presence vs absence is just about as good a measure (statistically). Of course, if you want to make an estimate of the overall number of ragworm then you do need to count them.

Note that if you have been following stories on vaccine trials in the news, this is equivalent to the stats being reported for a RCT (randomised control trial) between vaccinated and placebo groups. This is because analyses of drug trials also tend to use logistic regression (i.e glms of the binomial family)

25.8 Chi squared test and poisson glms

As an aside, I mentioned that the chi squared test is essentially just a poisson glm. This is how you can use a poisson glm to get the same p-value. The input data are the tabulated counts.

d %>% group_by(pres,Substrate) %>% summarise(count=n()) -> tb
#tb<-as.data.frame(table(d$pres,d$Substrate)) ## base R tables need to be turned to df
#names(tb)<-c("pres","substrate","count")
tb$pres<-as.factor(tb$pres)
tb

## # A tibble: 4 x 3
## # Groups:   pres [2]
##   pres  Substrate count
##   <fct> <fct>     <int>
## 1 0     Mud          23
## 2 0     Sand         44
## 3 1     Mud          27
## 4 1     Sand         16

If you remember your introductory stats the chisquared test using a contingency table is a test for association. So the element that is of interest in the model is an interaction between presence and substrate.

mod<-glm(data=tb,count~pres*Substrate,family="poisson")
summary(mod)

## 
## Call:
## glm(formula = count ~ pres * Substrate, family = "poisson", data = tb)
## 
## Deviance Residuals: 
## [1]  0  0  0  0
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           3.1355     0.2085  15.037  < 2e-16 ***
## pres1                 0.1603     0.2838   0.565  0.57202    
## SubstrateSand         0.6487     0.2573   2.521  0.01170 *  
## pres1:SubstrateSand  -1.1719     0.4071  -2.879  0.00399 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance:  1.4819e+01  on 3  degrees of freedom
## Residual deviance: -4.8850e-15  on 0  degrees of freedom
## AIC: 28.367
## 
## Number of Fisher Scoring iterations: 3

The p value should be the same as the chi squared test, if the technical continuity correction used for contingency tables is not made.

chisq.test( table(d$pres,d$Substrate),correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  table(d$pres, d$Substrate)
## X-squared = 8.5577, df = 1, p-value = 0.003441

Which it is. There is a very minor numerical difference due to rounding and minor differences in the actual mechanism of calculation, but the two numbers are practically equivalent. This equivalence applies to all situations, not just this data set. Notice it is also the same as the p-value for the binomial model.

The point being illustrated is that the assumption made by the chi squared test is that the counts follow a poisson distribution with a variance equal to the mean. This is used when testing for a difference between the observed number of counts and the expected number of counts under the null hypothesis of no association (interaction) between presence and substrate. So, the chi squared test is not even really a non parametric test!

Of the three possible ways of getting a p-value the most useful is clearly the binomial glm, as this provides confidence intervals. The poisson glm for the counts actually also could provide the same intervals, by looking at the interaction term, but this is rarely used in this context.

exp(confint(mod))

##                          2.5 %    97.5 %
## (Intercept)         14.8342716 33.720346
## pres1                0.6735089  2.063209
## SubstrateSand        1.1676593  3.219274
## pres1:SubstrateSand  0.1369500  0.679671