Chapter 15 Crib sheets
The crib sheets contain R code for running analyses. There is no accompanying text to explain the output nor advice on why to use the method. You must consult course material in order to decide whether it is sensible to apply the method.
In order to use the cribsheets you must first become completely familiar with the process of loading data into R’s memory by using either read.csv, for comma separated variable files or read_excel which can import data directly from an excel spreadsheet file. You need to know how to put together your own annotated markdown files, with embedded code chunks and annotated comments.
For each analysis an example data set is provided that is loaded from the /home/aqm/data folder on the server. The file is converted into a data table in the cribsheet. This data table can be exported and then used as the template for your own analysis.
To use the cribsheet, first look carefully at the format of the example data. Download this file and modify it in Excel, changing the values and headers to match your own data. Then build a markdown file using your own data as the input. Change any names of variables to match those used in your own data set. Providing you paste in chunks from the crib sheet in the right order you can then build your own bespoke analysis for your data that will reproduce the anaysis shown in the crobsheet. Order of the operations is very important, as some code chnuks are precursors to others. If you understand the logic of the analysis this will not be a problem.
15.1 Classical null hypothesis tests in R
This crib sheet shows how to run simple, introductory, statistical tests. These are very rarely the best way to analyse your data. Model based procedures are available that produce a more informative analysis in every case, including situations when a non-parametric test is often chosen.
15.1.1 Un-paired T-test
15.1.2 Data formats
15.1.2.1 Long format
d<-read.csv("/home/aqm/data/leaves.csv")
dt(d)
15.1.2.2 Wide format
You may sometimes be given data in a wide format, with one column per group. This is not a standard data frame. The most consistent approach is to turn it into a data frame
d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)
## To long format
d2<-gather(d2,key=leaf_type,value=leaf_area)
dt(d2)
d2$id<-rep(1:20,times=2)
d2 %>% spread(key=leaf_type,value=leaf_area)
## id shade sun
## 1 1 24 25
## 2 2 24 33
## 3 3 25 25
## 4 4 36 23
## 5 5 36 24
## 6 6 35 13
## 7 7 32 25
## 8 8 44 20
## 9 9 22 26
## 10 10 32 24
## 11 11 26 17
## 12 12 28 20
## 13 13 30 19
## 14 14 26 24
## 15 15 23 35
## 16 16 29 25
## 17 17 39 27
## 18 18 35 25
## 19 19 29 32
## 20 20 35 27
15.1.3 Boxplot
g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_box<- g0 +geom_boxplot()
g_box
15.1.4 Confidence interval plot
g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_conf <- g0 + stat_summary(fun.y=mean,geom="point") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_conf
15.1.5 Dynamite plot
g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_bar <- g0 + stat_summary(fun.y=mean,geom="bar") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_bar
15.1.6 T-test
t.test(d$leaf_area~d$leaf_type)
##
## Welch Two Sample t-test
##
## data: d$leaf_area by d$leaf_type
## t = 3.416, df = 37.343, p-value = 0.001546
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.462573 9.637427
## sample estimates:
## mean in group shade mean in group sun
## 30.50 24.45
15.2 Wilcoxon test (also known as ‘Mann-Whitney’ test)
15.2.1 Wide format
d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)
15.2.2 Wilcoxon test
wilcox.test(d2$shade,d2$sun)
##
## Wilcoxon rank sum test with continuity correction
##
## data: d2$shade and d2$sun
## W = 307.5, p-value = 0.003672
## alternative hypothesis: true location shift is not equal to 0
15.3 Paired T-test
15.3.1 Data formats
15.3.1.1 Wide format
The wide format seems the natural one to use in this case.
d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)
15.3.1.2 Long format
For the classic paired t-test function the long format is best changed to wide.
d<-read.csv("/home/aqm/data/paired1.csv")
dt(d)
## Change to wide
d %>% spread(time,val) %>% dt()
15.3.2 T-test
t.test(d2$After,d2$Before,paired=TRUE)
##
## Paired t-test
##
## data: d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3087225 10.8912775
## sample estimates:
## mean of the differences
## 5.6
d %>% spread(time,val) -> d2
t.test(d2$After,d2$Before,paired=TRUE)
##
## Paired t-test
##
## data: d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3087225 10.8912775
## sample estimates:
## mean of the differences
## 5.6
15.4 Paired Wilcoxon test
15.4.1 Wide format
The wide format seems the natural one to use in this case.
d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)
15.4.2 Wilcoxon test
wilcox.test(d2$Before,d2$After, paired=)
##
## Wilcoxon rank sum test with continuity correction
##
## data: d2$Before and d2$After
## W = 26, p-value = 0.07522
## alternative hypothesis: true location shift is not equal to 0
15.5 Correlation test
15.5.1 Data format
Only the standard data frame format is sensible in this case. However a data frame may consist of many variables that can be correlated with each other. In this case more advanced methods avoid the need to correlate each pair in turn. Fitting regresion lines is included in other crib sheets.
d<-read.csv("/home/aqm/data/arne_pines_simple.csv")
dt(d)
15.5.2 Scatterplot
g0<-ggplot(d,aes(x=twi,y=pine_density)) +geom_point()
g0
15.5.3 Pearson’s correlation test
cor.test(d$pine_density,d$twi)
##
## Pearson's product-moment correlation
##
## data: d$pine_density and d$twi
## t = 0.53448, df = 38, p-value = 0.5961
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2313542 0.3874638
## sample estimates:
## cor
## 0.08638047
15.5.4 Spearman’s rank correlation test
There are often ties, but this does not invalidate the test. R produces a warning in this case.
cor.test(d$pine_density,d$twi,method="spearman")
##
## Spearman's rank correlation rho
##
## data: d$pine_density and d$twi
## S = 8988.5, p-value = 0.3339
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1567992
15.6 Chi squared contingency tables
15.6.1 Data formats
15.6.1.1 Long format
The data will originally have been collected though classifying each observation. So, if the data consists of mud cores that have been classified into two categories of substrate, mud or sand, and two categories depending whether ragworm are present or absent you will produces a csv file with the format as shown.
d<-read.csv("/home/aqm/data/HedisteCat.csv")
dt(d)
15.6.1.2 Tablular format
You might already have tabulated the data in Excel. Providing that the table is in the top cells of the first sheet of an Excel spreadsheet, this code will load the data.
library(readxl)
system ("cp contingency_table.xlsx /home/aqm/data/contingency_table.xlsx")
ct <-read_excel("/home/aqm/data/contingency_table.xlsx")
dt(ct)
15.6.2 Table of counts
15.6.2.1 Table of counts using the data frame format
ct<-table(d)
ct
## Cat
## Substrate Absent Present
## Mud 23 27
## Sand 44 16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 23 | 27 |
Sand | 44 | 16 |
15.6.2.2 Table of counts using the ct format
## These data are already in tabular format, but some slight rearrangement is needed to turn them into an R table.
ct <-read_excel("contingency_table.xlsx")
ct<-as.data.frame(ct)
row.names(ct) <- ct[,1]
ct<-ct[,-1]
ct<-as.table(as.matrix(ct))
ct
## Absent Present
## Mud 23 27
## Sand 44 16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 23 | 27 |
Sand | 44 | 16 |
15.6.3 Table of Proportions
15.6.3.1 Table of proportions
pt<-round(prop.table(ct) *100,1)
pt
## Absent Present
## Mud 20.9 24.5
## Sand 40.0 14.5
### Formatted version for HTMl printing
pt %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 20.9 | 24.5 |
Sand | 40.0 | 14.5 |
15.6.3.2 Table of proportions for rows
ptr<-round(prop.table(ct,margin=1) *100,1)
ptr
## Absent Present
## Mud 46.0 54.0
## Sand 73.3 26.7
ptr %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 46.0 | 54.0 |
Sand | 73.3 | 26.7 |
15.6.3.3 Table of proportions for columns
ptc<-round(prop.table(table(d),margin=2) *100,1)
ptc
## Cat
## Substrate Absent Present
## Mud 34.3 62.8
## Sand 65.7 37.2
ptc %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 34.3 | 62.8 |
Sand | 65.7 | 37.2 |
15.6.4 Plots
15.6.4.1 Mosaic plot
mosaicplot(ct)
15.6.4.2 Barplot
ct_d<-data.frame(ct)
bc<-ggplot(ct_d,aes(x=Var2,y=Freq))+geom_bar(stat="identity")
bc + facet_wrap(~Var1)
15.6.5 Chisq-test
chisq.test(ct)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: ct
## X-squared = 7.4482, df = 1, p-value = 0.00635
15.6.6 Fisher’s exact test
fisher.test(ct)
##
## Fisher's Exact Test for Count Data
##
## data: ct
## p-value = 0.005719
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1287875 0.7387656
## sample estimates:
## odds ratio
## 0.3132911