A researcher is interested in looking at the differences in diameters of trees in three different woodland sites in the New Forest. At each site there are several different species. In order to simplify the task we will consider only two types of trees … conifers and broadleaves. We will also simplify the exercise by assuming that the same number of trees (50) are sampled in each woodland.
Set up a dataframe with three columns. One column represents the site. The second represents the type of tree (i.e. conifer or broadleaf). The third represents the diameters. So there will be 150 observations (rows) in all. Try to produce data in which there is a difference in mean diameter that is affected by the site from which the measurements are taken and the type of tree being measured. You can assume that the random variation in diameters is normally distributed and that measurements are taken to the nearest cm.
This is simple. We just need to replicate the name of the site 50 times.
sites<-rep(c("site_1","site_2","site_3"),each=50)
This is more complex. If we just want an equal number of each type of tree we could do soething like this.
tree<-rep(c("Bl","Con"),times=75)
d<-data.frame(sites,tree)
In reality we are more likely to have different numbers of trees types per site, particulalry if the trees are sampled randomly. We can go back to this, but let’s use this for the time being.
Now one way to go about the task would be assume that there is an overall mean diameter for each site, which differs between sites.
site_dbh<-rep(c(20,30,50),each=50)
If the effects are additive this could be simulated very simply in the same way.
tree_dbh<-rep(c(-5,5),times=75)
Add the two effects together then add random “noise”
dbh<-tree_dbh+site_dbh+rnorm(150,0,3)
d<-data.frame(d,dbh)
library(ggplot2)
g0<-ggplot(d,aes(y=dbh,x=tree))
g0+geom_boxplot() + facet_wrap(~sites)
There are many ways of making a more complex pattern of data using R. These do tend to involve using some additional coding tricks.
Let’s set up the data frame again for reference.
sites<-rep(c("site_1","site_2","site_3"),each=50)
tree<-sample(c("Bl","Con"),150,replace = TRUE)
d<-data.frame(sites,tree)
One approach is to build a function that takes each row of the data frame as an argument and returns the simulated DBH.
The most flexible approach produces a simulated DBH for each possible combination of site and species.
This is rather a “brute force” approach in the sense that every combination is identified separately rather than applying a single simple rule.
However it is easy to set up by simply cutting and pasting a set of lines and carefully changing the rule for each.
## d[1] is site and d[2] is species
## For each possible combination set up a mean and add a randome component.
dbh_function<-function(d){
if (d[1]=="site_1" && d[2]=="Bl") dbh<-10 + rnorm(1,0,2)
if (d[1]=="site_1" && d[2]=="Con") dbh<-5 + rnorm(1,0,1)
if (d[1]=="site_2" && d[2]=="Bl") dbh<-20 + rnorm(1,0,3)
if (d[1]=="site_2" && d[2]=="Con") dbh<-40 + rnorm(1,0,4)
if (d[1]=="site_3" && d[2]=="Bl") dbh<-30 + rnorm(1,0,2)
if (d[1]=="site_3" && d[2]=="Con") dbh<-10 + rnorm(1,0,1)
round(dbh,1)
}
## Then use the apply command to apply the function to each row.
d$dbh<-apply(d,1,dbh_function)
g0<-ggplot(d,aes(y=dbh,x=tree))
g0+geom_boxplot() + facet_wrap(~sites)
g0<-ggplot(d,aes(y=dbh,x=tree))
g0<-g0+stat_summary(fun.y=mean,geom="point") + facet_wrap(~sites)
g0<-g0+stat_summary(fun.data=mean_cl_normal,geom="errorbar")
g0
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
d %>% group_by(sites,tree) %>% summarise(n=n(),mean_dbh=mean(dbh))
## # A tibble: 6 x 4
## # Groups: sites [?]
## sites tree n mean_dbh
## <fct> <fct> <int> <dbl>
## 1 site_1 Bl 19 9.89
## 2 site_1 Con 31 5.28
## 3 site_2 Bl 25 20.3
## 4 site_2 Con 25 40.9
## 5 site_3 Bl 25 30.1
## 6 site_3 Con 25 9.80