How to use the crib sheet

The crib sheets contain R code for running analyses. There is no accompanying text to explain the output nor advice on why to use the method. You must consult course material in order to decide whether it is sensible to apply the method.

In order to use the cribsheets you must first become completely familiar with the process of loading data into R’s memory by using either read.csv, for comma separated variable files or read_excel which can import data directly from an excel spreadsheet file. You need to know how to put together your own annotated markdown files, with embedded code chunks and annotated comments.

For each analysis an example data set is provided that is loaded from the /home/aqm/data folder on the server. The file is converted into a data table in the cribsheet. This data table can be exported and then used as the template for your own analysis.

To use the cribsheet, first look carefully at the format of the example data. Download this file and modify it in Excel, changing the values and headers to match your own data. Then build a markdown file using your own data as the input. Change any names of variables to match those used in your own data set. Providing you paste in chunks from the crib sheet in the right order you can then build your own bespoke analysis for your data that will reproduce the anaysis shown in the crobsheet. Order of the operations is very important, as some code chnuks are precursors to others. If you understand the logic of the analysis this will not be a problem.

Classical null hypothesis tests in R

This crib sheet shows how to run simple, introductory, statistical tests. These are very rarely the best way to analyse your data. Model based procedures are available that produce a more informative analysis in every case, including situations when a non-parametric test is often chosen.

Un-paired T-test

Data formats

Long format

d<-read.csv("/home/aqm/data/leaves.csv")
dt(d)

Wide format

You may sometimes be given data in a wide format, with one column per group. This is not a standard data frame. The most consistent approach is to turn it into a data frame

d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)
## To long format
d2<-gather(d2,key=leaf_type,value=leaf_area)
dt(d2)
d2$id<-rep(1:20,times=2)
d2 %>% spread(key=leaf_type,value=leaf_area)
##    id shade sun
## 1   1    24  25
## 2   2    24  33
## 3   3    25  25
## 4   4    36  23
## 5   5    36  24
## 6   6    35  13
## 7   7    32  25
## 8   8    44  20
## 9   9    22  26
## 10 10    32  24
## 11 11    26  17
## 12 12    28  20
## 13 13    30  19
## 14 14    26  24
## 15 15    23  35
## 16 16    29  25
## 17 17    39  27
## 18 18    35  25
## 19 19    29  32
## 20 20    35  27

Boxplot

g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_box<- g0 +geom_boxplot()
g_box

Confidence interval plot

g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_conf <- g0 + stat_summary(fun.y=mean,geom="point") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_conf

Dynamite plot

g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_bar <- g0 + stat_summary(fun.y=mean,geom="bar") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_bar

T-test

t.test(d$leaf_area~d$leaf_type)
## 
##  Welch Two Sample t-test
## 
## data:  d$leaf_area by d$leaf_type
## t = 3.416, df = 37.343, p-value = 0.001546
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.462573 9.637427
## sample estimates:
## mean in group shade   mean in group sun 
##               30.50               24.45

Wilcoxon test (also known as ‘Mann-Whitney’ test)

Wide format

d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)

Wilcoxon test

wilcox.test(d2$shade,d2$sun)
## Warning in wilcox.test.default(d2$shade, d2$sun): cannot compute exact p-
## value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  d2$shade and d2$sun
## W = 307.5, p-value = 0.003672
## alternative hypothesis: true location shift is not equal to 0

Paired T-test

Data formats

Wide format

The wide format seems the natural one to use in this case.

d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)

Long format

For the classic paired t-test function the long format is best changed to wide.

d<-read.csv("/home/aqm/data/paired1.csv")
dt(d)
## Change to wide
d %>% spread(time,val) %>% dt()

T-test

t.test(d2$After,d2$Before,paired=TRUE)
## 
##  Paired t-test
## 
## data:  d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.3087225 10.8912775
## sample estimates:
## mean of the differences 
##                     5.6
d %>% spread(time,val) -> d2
t.test(d2$After,d2$Before,paired=TRUE)
## 
##  Paired t-test
## 
## data:  d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.3087225 10.8912775
## sample estimates:
## mean of the differences 
##                     5.6

Paired Wilcoxon test

Wide format

The wide format seems the natural one to use in this case.

d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)

Wilcoxon test

wilcox.test(d2$Before,d2$After, paired=)
## Warning in wilcox.test.default(d2$Before, d2$After, paired = ): cannot
## compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  d2$Before and d2$After
## W = 26, p-value = 0.07522
## alternative hypothesis: true location shift is not equal to 0

Correlation test

Data format

Only the standard data frame format is sensible in this case. However a data frame may consist of many variables that can be correlated with each other. In this case more advanced methods avoid the need to correlate each pair in turn. Fitting regresion lines is included in other crib sheets.

d<-read.csv("/home/aqm/data/arne_pines_simple.csv")
dt(d)

Scatterplot

g0<-ggplot(d,aes(x=twi,y=pine_density)) +geom_point()
g0

Pearson’s correlation test

cor.test(d$pine_density,d$twi)
## 
##  Pearson's product-moment correlation
## 
## data:  d$pine_density and d$twi
## t = 0.53448, df = 38, p-value = 0.5961
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2313542  0.3874638
## sample estimates:
##        cor 
## 0.08638047

Spearman’s rank correlation test

There are often ties, but this does not invalidate the test. R produces a warning in this case.

cor.test(d$pine_density,d$twi,method="spearman")
## Warning in cor.test.default(d$pine_density, d$twi, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  d$pine_density and d$twi
## S = 8988.5, p-value = 0.3339
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1567992

Chi squared contingency tables

Data formats

Long format

The data will originally have been collected though classifying each observation. So, if the data consists of mud cores that have been classified into two categories of substrate, mud or sand, and two categories depending whether ragworm are present or absent you will produces a csv file with the format as shown.

d<-read.csv("/home/aqm/data/HedisteCat.csv")
dt(d)

Tablular format

You might already have tabulated the data in Excel. Providing that the table is in the top cells of the first sheet of an Excel spreadsheet, this code will load the data.

library(readxl)
system ("cp contingency_table.xlsx /home/aqm/data/contingency_table.xlsx")
ct <-read_excel("/home/aqm/data/contingency_table.xlsx")
dt(ct)

Table of counts

Table of counts using the data frame format

ct<-table(d)
ct
##          Cat
## Substrate Absent Present
##      Mud      23      27
##      Sand     44      16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 23 27
Sand 44 16

Table of counts using the ct format

## These data are already in tabular format, but some slight rearrangement is needed to turn them into an R table.
ct <-read_excel("contingency_table.xlsx")
ct<-as.data.frame(ct)
row.names(ct) <- ct[,1]
ct<-ct[,-1]
ct<-as.table(as.matrix(ct))
ct
##      Absent Present
## Mud      23      27
## Sand     44      16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 23 27
Sand 44 16

Table of Proportions

Table of proportions

pt<-round(prop.table(ct) *100,1)
pt
##      Absent Present
## Mud    20.9    24.5
## Sand   40.0    14.5
### Formatted version for HTMl printing
pt %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 20.9 24.5
Sand 40.0 14.5

Table of proportions for rows

ptr<-round(prop.table(ct,margin=1) *100,1)
ptr
##      Absent Present
## Mud    46.0    54.0
## Sand   73.3    26.7
ptr %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 46.0 54.0
Sand 73.3 26.7

Table of proportions for columns

ptc<-round(prop.table(table(d),margin=2) *100,1)
ptc
##          Cat
## Substrate Absent Present
##      Mud    34.3    62.8
##      Sand   65.7    37.2
ptc %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent Present
Mud 34.3 62.8
Sand 65.7 37.2

Plots

Mosaic plot

mosaicplot(ct)

Barplot

ct_d<-data.frame(ct)

bc<-ggplot(ct_d,aes(x=Var2,y=Freq))+geom_bar(stat="identity")
bc + facet_wrap(~Var1)

Chisq-test

chisq.test(ct)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ct
## X-squared = 7.4482, df = 1, p-value = 0.00635

Fisher’s exact test

fisher.test(ct)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ct
## p-value = 0.005719
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1287875 0.7387656
## sample estimates:
## odds ratio 
##  0.3132911