The crib sheets contain R code for running analyses. There is no accompanying text to explain the output nor advice on why to use the method. You must consult course material in order to decide whether it is sensible to apply the method.
In order to use the cribsheets you must first become completely familiar with the process of loading data into R’s memory by using either read.csv, for comma separated variable files or read_excel which can import data directly from an excel spreadsheet file. You need to know how to put together your own annotated markdown files, with embedded code chunks and annotated comments.
For each analysis an example data set is provided that is loaded from the /home/aqm/data folder on the server. The file is converted into a data table in the cribsheet. This data table can be exported and then used as the template for your own analysis.
To use the cribsheet, first look carefully at the format of the example data. Download this file and modify it in Excel, changing the values and headers to match your own data. Then build a markdown file using your own data as the input. Change any names of variables to match those used in your own data set. Providing you paste in chunks from the crib sheet in the right order you can then build your own bespoke analysis for your data that will reproduce the anaysis shown in the crobsheet. Order of the operations is very important, as some code chnuks are precursors to others. If you understand the logic of the analysis this will not be a problem.
This crib sheet shows how to run simple, introductory, statistical tests. These are very rarely the best way to analyse your data. Model based procedures are available that produce a more informative analysis in every case, including situations when a non-parametric test is often chosen.
d<-read.csv("/home/aqm/data/leaves.csv")
dt(d)
You may sometimes be given data in a wide format, with one column per group. This is not a standard data frame. The most consistent approach is to turn it into a data frame
d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)
## To long format
d2<-gather(d2,key=leaf_type,value=leaf_area)
dt(d2)
d2$id<-rep(1:20,times=2)
d2 %>% spread(key=leaf_type,value=leaf_area)
## id shade sun
## 1 1 24 25
## 2 2 24 33
## 3 3 25 25
## 4 4 36 23
## 5 5 36 24
## 6 6 35 13
## 7 7 32 25
## 8 8 44 20
## 9 9 22 26
## 10 10 32 24
## 11 11 26 17
## 12 12 28 20
## 13 13 30 19
## 14 14 26 24
## 15 15 23 35
## 16 16 29 25
## 17 17 39 27
## 18 18 35 25
## 19 19 29 32
## 20 20 35 27
g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_box<- g0 +geom_boxplot()
g_box
g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_conf <- g0 + stat_summary(fun.y=mean,geom="point") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_conf
g0<-ggplot(d,aes(x=leaf_type,y=leaf_area))
g_bar <- g0 + stat_summary(fun.y=mean,geom="bar") + stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g_bar
t.test(d$leaf_area~d$leaf_type)
##
## Welch Two Sample t-test
##
## data: d$leaf_area by d$leaf_type
## t = 3.416, df = 37.343, p-value = 0.001546
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.462573 9.637427
## sample estimates:
## mean in group shade mean in group sun
## 30.50 24.45
d2<-read.csv("/home/aqm/data/leaves2.csv")
dt(d2)
wilcox.test(d2$shade,d2$sun)
## Warning in wilcox.test.default(d2$shade, d2$sun): cannot compute exact p-
## value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: d2$shade and d2$sun
## W = 307.5, p-value = 0.003672
## alternative hypothesis: true location shift is not equal to 0
The wide format seems the natural one to use in this case.
d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)
For the classic paired t-test function the long format is best changed to wide.
d<-read.csv("/home/aqm/data/paired1.csv")
dt(d)
## Change to wide
d %>% spread(time,val) %>% dt()
t.test(d2$After,d2$Before,paired=TRUE)
##
## Paired t-test
##
## data: d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3087225 10.8912775
## sample estimates:
## mean of the differences
## 5.6
d %>% spread(time,val) -> d2
t.test(d2$After,d2$Before,paired=TRUE)
##
## Paired t-test
##
## data: d2$After and d2$Before
## t = 2.3941, df = 9, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3087225 10.8912775
## sample estimates:
## mean of the differences
## 5.6
The wide format seems the natural one to use in this case.
d2<-read.csv("/home/aqm/data/paired2.csv")
dt(d2)
wilcox.test(d2$Before,d2$After, paired=)
## Warning in wilcox.test.default(d2$Before, d2$After, paired = ): cannot
## compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: d2$Before and d2$After
## W = 26, p-value = 0.07522
## alternative hypothesis: true location shift is not equal to 0
Only the standard data frame format is sensible in this case. However a data frame may consist of many variables that can be correlated with each other. In this case more advanced methods avoid the need to correlate each pair in turn. Fitting regresion lines is included in other crib sheets.
d<-read.csv("/home/aqm/data/arne_pines_simple.csv")
dt(d)
g0<-ggplot(d,aes(x=twi,y=pine_density)) +geom_point()
g0
cor.test(d$pine_density,d$twi)
##
## Pearson's product-moment correlation
##
## data: d$pine_density and d$twi
## t = 0.53448, df = 38, p-value = 0.5961
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2313542 0.3874638
## sample estimates:
## cor
## 0.08638047
There are often ties, but this does not invalidate the test. R produces a warning in this case.
cor.test(d$pine_density,d$twi,method="spearman")
## Warning in cor.test.default(d$pine_density, d$twi, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: d$pine_density and d$twi
## S = 8988.5, p-value = 0.3339
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1567992
The data will originally have been collected though classifying each observation. So, if the data consists of mud cores that have been classified into two categories of substrate, mud or sand, and two categories depending whether ragworm are present or absent you will produces a csv file with the format as shown.
d<-read.csv("/home/aqm/data/HedisteCat.csv")
dt(d)
You might already have tabulated the data in Excel. Providing that the table is in the top cells of the first sheet of an Excel spreadsheet, this code will load the data.
library(readxl)
system ("cp contingency_table.xlsx /home/aqm/data/contingency_table.xlsx")
ct <-read_excel("/home/aqm/data/contingency_table.xlsx")
dt(ct)
ct<-table(d)
ct
## Cat
## Substrate Absent Present
## Mud 23 27
## Sand 44 16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 23 | 27 |
Sand | 44 | 16 |
## These data are already in tabular format, but some slight rearrangement is needed to turn them into an R table.
ct <-read_excel("contingency_table.xlsx")
ct<-as.data.frame(ct)
row.names(ct) <- ct[,1]
ct<-ct[,-1]
ct<-as.table(as.matrix(ct))
ct
## Absent Present
## Mud 23 27
## Sand 44 16
### Formatted version for HTMl printing
ct %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 23 | 27 |
Sand | 44 | 16 |
pt<-round(prop.table(ct) *100,1)
pt
## Absent Present
## Mud 20.9 24.5
## Sand 40.0 14.5
### Formatted version for HTMl printing
pt %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 20.9 | 24.5 |
Sand | 40.0 | 14.5 |
ptr<-round(prop.table(ct,margin=1) *100,1)
ptr
## Absent Present
## Mud 46.0 54.0
## Sand 73.3 26.7
ptr %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 46.0 | 54.0 |
Sand | 73.3 | 26.7 |
ptc<-round(prop.table(table(d),margin=2) *100,1)
ptc
## Cat
## Substrate Absent Present
## Mud 34.3 62.8
## Sand 65.7 37.2
ptc %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Absent | Present | |
---|---|---|
Mud | 34.3 | 62.8 |
Sand | 65.7 | 37.2 |
mosaicplot(ct)
ct_d<-data.frame(ct)
bc<-ggplot(ct_d,aes(x=Var2,y=Freq))+geom_bar(stat="identity")
bc + facet_wrap(~Var1)
chisq.test(ct)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: ct
## X-squared = 7.4482, df = 1, p-value = 0.00635
fisher.test(ct)
##
## Fisher's Exact Test for Count Data
##
## data: ct
## p-value = 0.005719
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1287875 0.7387656
## sample estimates:
## odds ratio
## 0.3132911