Chapter 4 Classical null hypothesis tests in R

This crib sheet shows how to run simple, introductory, statistical tests. These are very rarely the best way to analyse your data. Model based procedures are available that produce a more informative analysis in every case, including situations when a non-parametric test is often chosen.

4.1 Un-paired T-test

4.1.1 Data formats

4.1.1.1 Long format

d<-read.csv("/home/aqm/data/leaves.csv")
dt(d)

4.1.1.2 Wide format

You may sometimes be given data in a wide format, with one column per group. This is not a standard data frame. The most consistent approach is to turn it into a data frame

d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)
## To long format
d2<-gather(d2,key=leaf_type,value=leaf_area)
dt(d2)
d2$id<-rep(1:20,times=2)
d2 %>% spread(key=leaf_type,value=leaf_area)
##    id shade sun
## 1   1    24  25
## 2   2    24  33
## 3   3    25  25
## 4   4    36  23
## 5   5    36  24
## 6   6    35  13
## 7   7    32  25
## 8   8    44  20
## 9   9    22  26
## 10 10    32  24
## 11 11    26  17
## 12 12    28  20
## 13 13    30  19
## 14 14    26  24
## 15 15    23  35
## 16 16    29  25
## 17 17    39  27
## 18 18    35  25
## 19 19    29  32
## 20 20    35  27

4.1.2 Boxplot

g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_box<- g0 +geom_boxplot()
g_box

4.1.3 Confidence interval plot

g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_conf <- g0 + stat_summary(fun.y=mean,geom="point") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_conf

4.1.4 Dynamite plot

g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_bar <- g0 + stat_summary(fun.y=mean,geom="bar") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_bar

4.1.5 T-test

t.test(d$leaf_area~d$leaf_type)
## 
##  Welch Two Sample t-test
## 
## data:  d$leaf_area by d$leaf_type
## t = 3.416, df = 37.343, p-value = 0.001546
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.462573 9.637427
## sample estimates:
## mean in group shade   mean in group sun 
##               30.50               24.45

4.2 Wilcoxon test (also known as ‘Mann-Whitney’ test)

4.2.1 Wide format

d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)

4.2.2 Wilcoxon test

wilcox.test(d2$shade,d2$sun)
## Warning in wilcox.test.default(d2$shade, d2$sun): cannot compute exact p-
## value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  d2$shade and d2$sun
## W = 307.5, p-value = 0.003672
## alternative hypothesis: true location shift is not equal to 0

4.3 Paired T-test

4.3.1 Data formats

4.3.1.1 Wide format

The wide format seems the natural one to use in this case.

d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)

4.3.1.2 Long format

For the classic paired t-test function the long format is best changed to wide.

d<-read.csv("/home/aqm/data/paired1.csv")
dt(d)
## Change to wide
d %>% spread(time,val) %>% dt()

4.3.2 T-test

t.test(d2$After,d2$Before,paired=TRUE)
## 
##  Paired t-test
## 
## data:  d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.3087225 10.8912775
## sample estimates:
## mean of the differences 
##                     5.6
d %>% spread(time,val) -> d2
t.test(d2$After,d2$Before,paired=TRUE)
## 
##  Paired t-test
## 
## data:  d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.3087225 10.8912775
## sample estimates:
## mean of the differences 
##                     5.6

4.4 Paired Wilcoxon test

4.4.1 Wide format

The wide format seems the natural one to use in this case.

d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)

4.4.2 Wilcoxon test

wilcox.test(d2$Before,d2$After, paired=)
## Warning in wilcox.test.default(d2$Before, d2$After, paired = ): cannot
## compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  d2$Before and d2$After
## W = 26, p-value = 0.07522
## alternative hypothesis: true location shift is not equal to 0

4.5 Correlation test

4.5.1 Data format

Only the standard data frame format is sensible in this case. However a data frame may consist of many variables that can be correlated with each other. In this case more advanced methods avoid the need to correlate each pair in turn. Fitting regresion lines is included in other crib sheets.

d<-read.csv("/home/aqm/data/arne_pines_simple.csv")
dt(d)

4.5.2 Scatterplot

g0<-ggplot(d,aes(x=twi,y=pine_density)) +geom_point()
g0

4.5.3 Pearson’s correlation test

cor.test(d$pine_density,d$twi)
## 
##  Pearson's product-moment correlation
## 
## data:  d$pine_density and d$twi
## t = 0.53448, df = 38, p-value = 0.5961
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2313542  0.3874638
## sample estimates:
##        cor 
## 0.08638047

4.5.4 Spearman’s rank correlation test

There are often ties, but this does not invalidate the test. R produces a warning in this case.

cor.test(d$pine_density,d$twi,method="spearman")
## Warning in cor.test.default(d$pine_density, d$twi, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  d$pine_density and d$twi
## S = 8988.5, p-value = 0.3339
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1567992

4.6 Chi squared contingency tables

4.6.1 Data formats

4.6.1.1 Long format

The data will originally have been collected though classifying each observation. So, if the data consists of mud cores that have been classified into two categories of substrate, mud or sand, and two categories depending whether ragworm are present or absent you will produces a csv file with the format as shown.

d<-read.csv("/home/aqm/data/HedisteCat.csv")
dt(d)

4.6.1.2 Tablular format

You might already have tabulated the data in Excel. Providing that the table is in the top cells of the first sheet of an Excel spreadsheet, this code will load the data.

library(readxl)
system ("cp contingency_table.xlsx /home/aqm/data/contingency_table.xlsx")
ct <-read_excel("/home/aqm/data/contingency_table.xlsx")
dt(ct)

4.6.2 Table of counts

4.6.2.1 Table of counts using the data frame format

ct<-table(d)
ct
##          Cat
## Substrate Absent Present
##      Mud      23      27
##      Sand     44      16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 23 27
Sand 44 16

4.6.2.2 Table of counts using the ct format

## These data are already in tabular format, but some slight rearrangement is needed to turn them into an R table.
ct <-read_excel("contingency_table.xlsx")
ct<-as.data.frame(ct)
row.names(ct) <- ct[,1]
ct<-ct[,-1]
ct<-as.table(as.matrix(ct))
ct
##      Absent Present
## Mud      23      27
## Sand     44      16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 23 27
Sand 44 16

4.6.3 Table of Proportions

4.6.3.1 Table of proportions

pt<-round(prop.table(ct) *100,1)
pt
##      Absent Present
## Mud    20.9    24.5
## Sand   40.0    14.5
### Formatted version for HTMl printing
pt %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 20.9 24.5
Sand 40.0 14.5

4.6.3.2 Table of proportions for rows

ptr<-round(prop.table(ct,margin=1) *100,1)
ptr
##      Absent Present
## Mud    46.0    54.0
## Sand   73.3    26.7
ptr %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 46.0 54.0
Sand 73.3 26.7

4.6.3.3 Table of proportions for columns

ptc<-round(prop.table(table(d),margin=2) *100,1)
ptc
##          Cat
## Substrate Absent Present
##      Mud    34.3    62.8
##      Sand   65.7    37.2
ptc %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 34.3 62.8
Sand 65.7 37.2

4.6.4 Plots

4.6.4.1 Mosaic plot

mosaicplot(ct)

4.6.4.2 Barplot

ct_d<-data.frame(ct)

bc<-ggplot(ct_d,aes(x=Var2,y=Freq))+geom_bar(stat="identity")
bc + facet_wrap(~Var1)

4.6.5 Chisq-test

chisq.test(ct)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ct
## X-squared = 7.4482, df = 1, p-value = 0.00635

4.6.6 Fisher’s exact test

fisher.test(ct)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ct
## p-value = 0.005719
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1287875 0.7387656
## sample estimates:
## odds ratio 
##  0.3132911