library(tidyverse)
A researcher is interested in looking at the differences in diameters of trees in three different woodland sites in the New Forest. At each site there are several different species. In order to simplify the task we will consider only two types of trees … conifers and broadleaves. We will also simplify the exercise by assuming that the same number of trees (50) are sampled in each woodland.
Set up a dataframe with three columns. One column represents the site. The second represents the type of tree (i.e. conifer or broadleaf). The third represents the diameters. So there will be 150 observations (rows) in all. Try to produce data in which there is a difference in mean diameter that is affected by the site from which the measurements are taken and the type of tree being measured. You can assume that the random variation in diameters is normally distributed and that measurements are taken to the nearest cm.
This is simple. We just need to replicate the name of the site 50 times.
sites<-rep(c("site_1","site_2","site_3"),each=50)
This is more complex. If we just want an equal number of each type of tree we could do something like this. We’ll call the trees “species” although this refers simply to the distinction between broadleaves and conifers in this particular case.
species<-rep(c("Bl","Con"),times=75)
d<-data.frame(sites,species)
In reality we are more likely to have different numbers of trees types per site, particularly if the trees are sampled randomly. We can go back to this, but let’s use this for the time being.
Now one way to go about the task would be assume that there is an overall mean diameter for each site, which differs between sites.
site_dbh<-rep(c(20,30,50),each=50)
If the effects are additive this could be simulated very simply in the same way.
tree_dbh<-rep(c(-5,5),times=75)
Add the two effects together then add random “noise”. The noise in this case is uniform accross all combinations of sites and species. This is not realistic in practice, but this can serve as a “null model” for analysis. A diagnostic test for heterogeneity of variance should show that the variance is homogeneous.
dbh<-tree_dbh+site_dbh+rnorm(150,0,3)
d<-data.frame(d,dbh)
library(ggplot2)
g0<-ggplot(d,aes(y=dbh,x=species)) + facet_wrap(~sites)
g0+geom_boxplot()
The aqm package includes a quick way of forming a confidence interval plot.
library(aqm)
ci(g0)
## Warning: `fun.y` is deprecated. Use `fun` instead.
mod<-lm(data=d,dbh~sites*species)
summary(mod)
##
## Call:
## lm(formula = dbh ~ sites * species, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7128 -1.7015 0.0183 1.7090 8.3320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.0680 0.5998 25.120 <2e-16 ***
## sitessite_2 10.1752 0.8483 11.995 <2e-16 ***
## sitessite_3 29.2013 0.8483 34.423 <2e-16 ***
## speciesCon 10.6475 0.8483 12.551 <2e-16 ***
## sitessite_2:speciesCon -1.1722 1.1997 -0.977 0.330
## sitessite_3:speciesCon 1.0532 1.1997 0.878 0.381
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.999 on 144 degrees of freedom
## Multiple R-squared: 0.9547, Adjusted R-squared: 0.9531
## F-statistic: 606.4 on 5 and 144 DF, p-value: < 2.2e-16
The model shows no significant interaction between species and sites, which is unrealistic in practice, but represents the assumptions made when setting up the dummy data. We will look at the meaning of statistical interactions later in the unit.
There are many ways of making a more complex pattern of data using R. These do tend to involve using some additional coding tricks.
Let’s set up the data frame again for reference.
sites<-rep(c("site_1","site_2","site_3"),each=50)
species<-sample(c("Bl","Con"),150,replace = TRUE)
d<-data.frame(sites,species)
str(d)
## 'data.frame': 150 obs. of 2 variables:
## $ sites : Factor w/ 3 levels "site_1","site_2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ species: Factor w/ 2 levels "Bl","Con": 1 1 1 2 2 1 1 1 2 1 ...
One way of adding in an expected diameter for each combination of site and species is to produce a data frame containing the expected values separately.
expected<-data.frame(sites=rep(c("site_1","site_2","site_3"),each=2),
species= rep(c("Bl","Con"),times=3), dbh=c(20,22,30,25,35,32))
expected
## sites species dbh
## 1 site_1 Bl 20
## 2 site_1 Con 22
## 3 site_2 Bl 30
## 4 site_2 Con 25
## 5 site_3 Bl 35
## 6 site_3 Con 32
Now we can use dplyr to join the two data frames. Joining data frames is a powerful way of working with more complex data sets. For a join the work smoothly using the default settings there must be some (but not all) columns in the two data frames being joined that exactly match both in terms of names and data type.
d %>% right_join(expected) -> d
head(d)
## sites species dbh
## 1 site_1 Bl 20
## 2 site_1 Bl 20
## 3 site_1 Bl 20
## 4 site_1 Con 22
## 5 site_1 Con 22
## 6 site_1 Bl 20
To add the randmom noise we will use another dplyr trick. By applying a function in a rowise fashion we can use rnorm to produce one random draw from a normal distribution with a mean determined by the dbh for the row and a given standrad deviation (3)
d %>% rowwise() %>% mutate(dbh=round(rnorm(1,dbh,3),1)) -> d
head(d)
## # A tibble: 6 x 3
## # Rowwise:
## sites species dbh
## <fct> <fct> <dbl>
## 1 site_1 Bl 17.9
## 2 site_1 Bl 18.6
## 3 site_1 Bl 15.2
## 4 site_1 Con 22.1
## 5 site_1 Con 18.2
## 6 site_1 Bl 23.5
library(ggplot2)
g0<-ggplot(d,aes(y=dbh,x=species)) + facet_wrap(~sites)
g0+geom_boxplot()
Plotting the confidence intervals.
ci(g0)
## Warning: `fun.y` is deprecated. Use `fun` instead.
In this example a data frame with a single value for each combination is set up intially. A column holding a specific standard deviation is also added. This allows heterogeneity of variance to be included.
diams<-data.frame(sites=rep(c("site_1","site_2","site_3"),each=2),
species= rep(c("Bl","Con"),times=3), mean=c(20,22,30,25,35,32),sd=c(1,3,2,4,3,5))
diams
## sites species mean sd
## 1 site_1 Bl 20 1
## 2 site_1 Con 22 3
## 3 site_2 Bl 30 2
## 4 site_2 Con 25 4
## 5 site_3 Bl 35 3
## 6 site_3 Con 32 5
Now this time we’ll just sample randomly from this data frame. This results in differening numbers of trees for each combination.
diams<-sample_n(diams,200,replace=TRUE)
head(diams)
## sites species mean sd
## 1 site_3 Bl 35 3
## 2 site_2 Con 25 4
## 3 site_3 Bl 35 3
## 4 site_3 Bl 35 3
## 5 site_3 Bl 35 3
## 6 site_1 Con 22 3
diams %>% rowwise() %>% mutate(dbh=round(rnorm(1,mean,sd),1)) -> diams
The relationship between height and diameter is likely to be non-linear. It is also going to be related to the type of tree (conifers are usually higher for a given diameter). We’ll ignore the non linearity, but condition on species.
h_func<-function(species, diam){
if (species == "Bl") height<-2+diam*0.4 +rnorm(1,0,3)
if (species == "Con") height<-5+diam*0.5 +rnorm(1,0,3)
round(height,1)
}
diams %>% rowwise() %>% mutate(height=h_func(species,dbh)) -> diams
diams
## # A tibble: 200 x 6
## # Rowwise:
## sites species mean sd dbh height
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 site_3 Bl 35 3 32.8 17.4
## 2 site_2 Con 25 4 14.6 17.7
## 3 site_3 Bl 35 3 35.6 15.7
## 4 site_3 Bl 35 3 31.7 8.4
## 5 site_3 Bl 35 3 33.7 15.1
## 6 site_1 Con 22 3 24.6 17.9
## 7 site_1 Con 22 3 21.8 8.6
## 8 site_1 Bl 20 1 19.2 9.8
## 9 site_1 Bl 20 1 19.3 7.3
## 10 site_3 Con 32 5 36.3 19.5
## # … with 190 more rows
ggplot(diams,aes(x=dbh,y=height,col=species)) + geom_point() + geom_smooth(method="lm")