Chapter 2 Classical null hypothesis tests in R
This crib sheet shows how to run simple, introductory, statistical tests. These are very rarely the best way to analyse your data. Model based procedures are available that produce a more informative analysis in every case, including situations when a non-parametric test is often chosen.
2.1 Un-paired T-test
2.1.1 Data formats
2.1.1.1 Long format
d<-read.csv("/home/aqm/data/leaves.csv")
dt(d)
2.1.1.2 Wide format
You may sometimes be given data in a wide format, with one column per group. This is not a standard data frame. The most consistent approach is to turn it into a data frame
d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)
## To long format
d2<-gather(d2,key=leaf_type,value=leaf_area)
dt(d2)
d2$id<-rep(1:20,times=2)
d2 %>% spread(key=leaf_type,value=leaf_area)
## id shade sun
## 1 1 24 25
## 2 2 24 33
## 3 3 25 25
## 4 4 36 23
## 5 5 36 24
## 6 6 35 13
## 7 7 32 25
## 8 8 44 20
## 9 9 22 26
## 10 10 32 24
## 11 11 26 17
## 12 12 28 20
## 13 13 30 19
## 14 14 26 24
## 15 15 23 35
## 16 16 29 25
## 17 17 39 27
## 18 18 35 25
## 19 19 29 32
## 20 20 35 27
2.1.2 Boxplot
g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_box<- g0 +geom_boxplot()
g_box
2.1.3 Confidence interval plot
g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_conf <- g0 + stat_summary(fun.y=mean,geom="point") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_conf
2.1.4 Dynamite plot
g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_bar <- g0 + stat_summary(fun.y=mean,geom="bar") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_bar
2.1.5 T-test
t.test(d$leaf_area~d$leaf_type)
##
## Welch Two Sample t-test
##
## data: d$leaf_area by d$leaf_type
## t = 3.416, df = 37.343, p-value = 0.001546
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.462573 9.637427
## sample estimates:
## mean in group shade mean in group sun
## 30.50 24.45
2.2 Wilcoxon test (also known as ‘Mann-Whitney’ test)
2.2.1 Wide format
d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)
2.2.2 Wilcoxon test
wilcox.test(d2$shade,d2$sun)
## Warning in wilcox.test.default(d2$shade, d2$sun): cannot compute exact p-
## value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: d2$shade and d2$sun
## W = 307.5, p-value = 0.003672
## alternative hypothesis: true location shift is not equal to 0
2.3 Paired T-test
2.3.1 Data formats
2.3.1.1 Wide format
The wide format seems the natural one to use in this case.
d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)
2.3.1.2 Long format
For the classic paired t-test function the long format is best changed to wide.
d<-read.csv("/home/aqm/data/paired1.csv")
dt(d)
## Change to wide
d %>% spread(time,val) %>% dt()
2.3.2 T-test
t.test(d2$After,d2$Before,paired=TRUE)
##
## Paired t-test
##
## data: d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3087225 10.8912775
## sample estimates:
## mean of the differences
## 5.6
d %>% spread(time,val) -> d2
t.test(d2$After,d2$Before,paired=TRUE)
##
## Paired t-test
##
## data: d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3087225 10.8912775
## sample estimates:
## mean of the differences
## 5.6
2.4 Paired Wilcoxon test
2.4.1 Wide format
The wide format seems the natural one to use in this case.
d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)
2.4.2 Wilcoxon test
wilcox.test(d2$Before,d2$After, paired=)
## Warning in wilcox.test.default(d2$Before, d2$After, paired = ): cannot
## compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: d2$Before and d2$After
## W = 26, p-value = 0.07522
## alternative hypothesis: true location shift is not equal to 0
2.5 Correlation test
2.5.1 Data format
Only the standard data frame format is sensible in this case. However a data frame may consist of many variables that can be correlated with each other. In this case more advanced methods avoid the need to correlate each pair in turn. Fitting regresion lines is included in other crib sheets.
d<-read.csv("/home/aqm/data/arne_pines_simple.csv")
dt(d)
2.5.2 Scatterplot
g0<-ggplot(d,aes(x=twi,y=pine_density)) +geom_point()
g0
2.5.3 Pearson’s correlation test
cor.test(d$pine_density,d$twi)
##
## Pearson's product-moment correlation
##
## data: d$pine_density and d$twi
## t = 0.53448, df = 38, p-value = 0.5961
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2313542 0.3874638
## sample estimates:
## cor
## 0.08638047
2.5.4 Spearman’s rank correlation test
There are often ties, but this does not invalidate the test. R produces a warning in this case.
cor.test(d$pine_density,d$twi,method="spearman")
## Warning in cor.test.default(d$pine_density, d$twi, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: d$pine_density and d$twi
## S = 8988.5, p-value = 0.3339
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1567992
2.6 Chi squared contingency tables
2.6.1 Data formats
2.6.1.1 Long format
The data will originally have been collected though classifying each observation. So, if the data consists of mud cores that have been classified into two categories of substrate, mud or sand, and two categories depending whether ragworm are present or absent you will produces a csv file with the format as shown.
d<-read.csv("/home/aqm/data/HedisteCat.csv")
dt(d)
2.6.1.2 Tablular format
You might already have tabulated the data in Excel. Providing that the table is in the top cells of the first sheet of an Excel spreadsheet, this code will load the data.
library(readxl)
system ("cp contingency_table.xlsx /home/aqm/data/contingency_table.xlsx")
ct <-read_excel("/home/aqm/data/contingency_table.xlsx")
dt(ct)
2.6.2 Table of counts
2.6.2.1 Table of counts using the data frame format
ct<-table(d)
ct
## Cat
## Substrate Absent Present
## Mud 23 27
## Sand 44 16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 23 | 27 |
Sand | 44 | 16 |
2.6.2.2 Table of counts using the ct format
## These data are already in tabular format, but some slight rearrangement is needed to turn them into an R table.
ct <-read_excel("contingency_table.xlsx")
ct<-as.data.frame(ct)
row.names(ct) <- ct[,1]
ct<-ct[,-1]
ct<-as.table(as.matrix(ct))
ct
## Absent Present
## Mud 23 27
## Sand 44 16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 23 | 27 |
Sand | 44 | 16 |
2.6.3 Table of Proportions
2.6.3.1 Table of proportions
pt<-round(prop.table(ct) *100,1)
pt
## Absent Present
## Mud 20.9 24.5
## Sand 40.0 14.5
### Formatted version for HTMl printing
pt %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 20.9 | 24.5 |
Sand | 40.0 | 14.5 |
2.6.3.2 Table of proportions for rows
ptr<-round(prop.table(ct,margin=1) *100,1)
ptr
## Absent Present
## Mud 46.0 54.0
## Sand 73.3 26.7
ptr %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 46.0 | 54.0 |
Sand | 73.3 | 26.7 |
2.6.3.3 Table of proportions for columns
ptc<-round(prop.table(table(d),margin=2) *100,1)
ptc
## Cat
## Substrate Absent Present
## Mud 34.3 62.8
## Sand 65.7 37.2
ptc %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 34.3 | 62.8 |
Sand | 65.7 | 37.2 |
2.6.4 Plots
2.6.4.1 Mosaic plot
mosaicplot(ct)
2.6.4.2 Barplot
ct_d<-data.frame(ct)
bc<-ggplot(ct_d,aes(x=Var2,y=Freq))+geom_bar(stat="identity")
bc + facet_wrap(~Var1)
2.6.5 Chisq-test
chisq.test(ct)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: ct
## X-squared = 7.4482, df = 1, p-value = 0.00635
2.6.6 Fisher’s exact test
fisher.test(ct)
##
## Fisher's Exact Test for Count Data
##
## data: ct
## p-value = 0.005719
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1287875 0.7387656
## sample estimates:
## odds ratio
## 0.3132911