theme_set(theme_bw())
d<-read.csv("/home/aqm/course/data/HedisteCounts.csv")
The best way to look at these data quickly is by plotting histograms of the counts. I’ll use ggplot and facet wrapping
ggplot(d, aes(x=Count)) +geom_histogram(binwidth=3,fill="grey", col="black") +facet_wrap(~Substrate)
So its very obvious that the data are not normally distributed and there are lots of zeros. So, an undergraduate would probably use a Wilcoxon rank sum test (Mann Whitney in SPSS) in order to get a p-value.
wilcox.test(d$Count~d$Substrate)
##
## Wilcoxon rank sum test with continuity correction
##
## data: d$Count by d$Substrate
## W = 2006, p-value = 0.0005577
## alternative hypothesis: true location shift is not equal to 0
Great! We’ve found a significant difference. However, where is the effect size?
Let’s get the means anyway.
d%>% group_by(Substrate) %>% summarise(mean(Count))
## # A tibble: 2 x 2
## Substrate `mean(Count)`
## * <chr> <dbl>
## 1 Mud 3.8
## 2 Sand 1.02
Looks like there are around four times as many ragworm in the mud than the sand. But we can’t say that based on the non-parametric test.
Let’s do a T test
t.test(d$Count~d$Substrate)
##
## Welch Two Sample t-test
##
## data: d$Count by d$Substrate
## t = 3.561, df = 62.74, p-value = 0.0007115
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.221265 4.345402
## sample estimates:
## mean in group Mud mean in group Sand
## 3.800000 1.016667
Yippee! It’s significant again. But my tutor is gong to tell me to put it into SPSS and SPSS will tell me that the test isn’t valid.
mod<-lm(data=d,Count~Substrate)
anova(mod)
## Analysis of Variance Table
##
## Response: Count
## Df Sum Sq Mean Sq F value Pr(>F)
## Substrate 1 211.28 211.28 14.451 0.0002385 ***
## Residuals 108 1578.98 14.62
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Test residuals for normality
shapiro.test(residuals(mod))
##
## Shapiro-Wilk normality test
##
## data: residuals(mod)
## W = 0.7562, p-value = 3.152e-12
Sugar. Not good.
Test for homogeneity of variance?
bartlett.test(residuals(mod)~d$Substrate)
##
## Bartlett test of homogeneity of variances
##
## data: residuals(mod) by d$Substrate
## Bartlett's K-squared = 39.177, df = 1, p-value = 3.871e-10
Not valid. So, I’ll have to use the wilcox test?
Now I’m a master’s student I’m more sophisticated. I’ll use a GLM of the poisson family with a log link function. That will convince my tutor!
mod<-glm(data=d,Count~Substrate,family="poisson")
summary(mod)
##
## Call:
## glm(formula = Count ~ Substrate, family = "poisson", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7568 -1.4259 -1.4259 0.4656 5.8335
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.33500 0.07255 18.402 <2e-16 ***
## SubstrateSand -1.31847 0.14716 -8.959 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 600.35 on 109 degrees of freedom
## Residual deviance: 505.17 on 108 degrees of freedom
## AIC: 654.94
##
## Number of Fisher Scoring iterations: 6
Oops. I’ve spotted over dispersion. Should have seen that one coming from the histograms.
But… I’m a very savvy MSC student because I took the AQM course. I have an answer to that. I’ll use a glm of the negative binomial family.
library(MASS)
mod<-glm.nb(data=d,Count~Substrate)
summary(mod)
##
## Call:
## glm.nb(formula = Count ~ Substrate, data = d, init.theta = 0.2727104969,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2143 -0.9205 -0.9205 0.1156 1.6322
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3350 0.2804 4.762 1.92e-06 ***
## SubstrateSand -1.3185 0.3951 -3.337 0.000847 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.2727) family taken to be 1)
##
## Null deviance: 100.795 on 109 degrees of freedom
## Residual deviance: 89.795 on 108 degrees of freedom
## AIC: 385.39
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.2727
## Std. Err.: 0.0608
##
## 2 x log-likelihood: -379.3940
That looks much better.
Let’s try to understand the coefficients.
exp(coef(mod))
## (Intercept) SubstrateSand
## 3.8000000 0.2675439
Aha. The first is the mean for mud. What is the second one?
exp(coef(mod)[1]) * exp(coef(mod)[2])
## (Intercept)
## 1.016667
Right. It’s the value that I need to multiply the count in mud by in order to get the count in sand.
Effect size with confidence intervals?
exp(confint(mod))
## 2.5 % 97.5 %
## (Intercept) 2.2821868 6.9085692
## SubstrateSand 0.1216029 0.5794958
So the effect lies between 12% and 58%. We have an answer!
Converting to presence absence.
d$pres<-1*(d$Count>0)
Tabulate the data.
table(d$pres,d$Substrate)
##
## Mud Sand
## 0 23 44
## 1 27 16
Plot as bar chart. (An undergrad would use Excel)
d %>% group_by(Substrate,pres) %>% summarise(n=n()) %>%
ggplot(aes(x=Substrate,y=n, fill=as.factor(pres))) + geom_col(position="dodge")
Chi square test
It’s a 2x2 contingency table. I can do a chi squared!
chisq.test( table(d$pres,d$Substrate))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(d$pres, d$Substrate)
## X-squared = 7.4482, df = 1, p-value = 0.00635
Yippee! It’s significant. But … where is the effect size? Fit a binomial glm
mod<-glm(data=d,pres~Substrate,family="binomial")
summary(mod)
##
## Call:
## glm(formula = pres ~ Substrate, family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2462 -0.7876 -0.7876 1.1101 1.6259
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1603 0.2838 0.565 0.57202
## SubstrateSand -1.1719 0.4071 -2.879 0.00399 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 147.21 on 109 degrees of freedom
## Residual deviance: 138.58 on 108 degrees of freedom
## AIC: 142.58
##
## Number of Fisher Scoring iterations: 4
exp(coef(mod))
## (Intercept) SubstrateSand
## 1.1739130 0.3097643
Hang on… how do we interpret these exponentiated coeficients? They are odds.
table(d$pres,d$Substrate)
##
## Mud Sand
## 0 23 44
## 1 27 16
So 27 to 23 for presence in mud gives ..
27/23
## [1] 1.173913
Multiply this by the “relative risk” of finding ragworm in sand.
1.173913 * 0.3097643
## [1] 0.3636363
Should be the same as
16/44
## [1] 0.3636364
Which it is.
So when we get confidence intervals the key value refers to the odds ratio.
exp(confint(mod))
## 2.5 % 97.5 %
## (Intercept) 0.6735089 2.063209
## SubstrateSand 0.1369500 0.679671
So we have got a confidence interval for a meaningful quantity, but we have unfortunately entered the interpretabilty minefield thrown up by relative odds ratios. Still better than using a chi squared!
We end up by being able to state that the relative odds of finding ragworm present in “sand” cores compared to “mud” cores (note that the variable itself is based on a binary split) is between 14% and 68%. Compare this to the difference between the result for the analysis of abundance. It is surprisingly similar. This illustrates that there is not really that much information in the actual counts. Presence vs absence is just about as good a measure (statistically). Of course, if you want to make an estimate of the overall number of ragworm then you do need to count them.
Note that if you have been following stories on vaccine trials in the news, this is equivalent to the stats being reported for a RCT (randomised control trial) between vaccinated and placebo groups. This is because analyses of drug trials also tend to use logistic regression (i.e glms of the binomial family)
As an aside, I mentioned that the chi squared test is essentially just a poisson glm. This is how you can use a poisson glm to get the same p-value. The input data are the tabulated counts.
d %>% group_by(pres,Substrate) %>% summarise(count=n()) -> tb
#tb<-as.data.frame(table(d$pres,d$Substrate)) ## base R tables need to be turned to df
#names(tb)<-c("pres","substrate","count")
tb$pres<-as.factor(tb$pres)
tb
## # A tibble: 4 x 3
## # Groups: pres [2]
## pres Substrate count
## <fct> <chr> <int>
## 1 0 Mud 23
## 2 0 Sand 44
## 3 1 Mud 27
## 4 1 Sand 16
If you remember your introductory stats the chisquared test using a contingency table is a test for association. So the element that is of interest in the model is an interaction between presence and substrate.
mod<-glm(data=tb,count~pres*Substrate,family="poisson")
summary(mod)
##
## Call:
## glm(formula = count ~ pres * Substrate, family = "poisson", data = tb)
##
## Deviance Residuals:
## [1] 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.1355 0.2085 15.037 < 2e-16 ***
## pres1 0.1603 0.2838 0.565 0.57202
## SubstrateSand 0.6487 0.2573 2.521 0.01170 *
## pres1:SubstrateSand -1.1719 0.4071 -2.879 0.00399 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1.4819e+01 on 3 degrees of freedom
## Residual deviance: -4.8850e-15 on 0 degrees of freedom
## AIC: 28.367
##
## Number of Fisher Scoring iterations: 3
The p value should be the same as the chi squared test, if the technical continuity correction used for contingency tables is not made.
chisq.test( table(d$pres,d$Substrate),correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: table(d$pres, d$Substrate)
## X-squared = 8.5577, df = 1, p-value = 0.003441
Which it is. There is a very minor numerical difference due to rounding and minor differences in the actual mechanism of calculation, but the two numbers are practically equivalent. This equivalence applies to all situations, not just this data set. Notice it is also the same as the p-value for the binomial model.
The point being illustrated is that the assumption made by the chi squared test is that the counts follow a poisson distribution with a variance equal to the mean. This is used when testing for a difference between the observed number of counts and the expected number of counts under the null hypothesis of no association (interaction) between presence and substrate. So, the chi squared test is not even really a non parametric test!
Of the three possible ways of getting a p-value the most useful is clearly the binomial glm, as this provides confidence intervals. The poisson glm for the counts actually also could provide the same intervals, by looking at the interaction term, but this is rarely used in this context.
exp(confint(mod))
## 2.5 % 97.5 %
## (Intercept) 14.8342716 33.720346
## pres1 0.6735089 2.063209
## SubstrateSand 1.1676593 3.219274
## pres1:SubstrateSand 0.1369500 0.679671