library(tidyverse)

Exercise specification

A researcher is interested in looking at the differences in diameters of trees in three different woodland sites in the New Forest. At each site there are several different species. In order to simplify the task we will consider only two types of trees … conifers and broadleaves. We will also simplify the exercise by assuming that the same number of trees (50) are sampled in each woodland.

Set up a dataframe with three columns. One column represents the site. The second represents the type of tree (i.e. conifer or broadleaf). The third represents the diameters. So there will be 150 observations (rows) in all. Try to produce data in which there is a difference in mean diameter that is affected by the site from which the measurements are taken and the type of tree being measured. You can assume that the random variation in diameters is normally distributed and that measurements are taken to the nearest cm.

Solution using base R

Three different sites

This is simple. We just need to replicate the name of the site 50 times.

sites<-rep(c("site_1","site_2","site_3"),each=50)

Two types of tree

This is more complex. If we just want an equal number of each type of tree we could do something like this. We’ll call the trees “species” although this refers simply to the distinction between broadleaves and conifers in this particular case.

species<-rep(c("Bl","Con"),times=75)
d<-data.frame(sites,species)

In reality we are more likely to have different numbers of trees types per site, particularly if the trees are sampled randomly. We can go back to this, but let’s use this for the time being.

Differences in diameter per site.

Now one way to go about the task would be assume that there is an overall mean diameter for each site, which differs between sites.

site_dbh<-rep(c(20,30,50),each=50)

Differences in diameter per type

If the effects are additive this could be simulated very simply in the same way.

tree_dbh<-rep(c(-5,5),times=75)

Forming the underlying data

Add the two effects together then add random “noise”. The noise in this case is uniform accross all combinations of sites and species. This is not realistic in practice, but this can serve as a “null model” for analysis. A diagnostic test for heterogeneity of variance should show that the variance is homogeneous.

dbh<-tree_dbh+site_dbh+rnorm(150,0,3)
d<-data.frame(d,dbh)

Plot the results

library(ggplot2)
g0<-ggplot(d,aes(y=dbh,x=species))  + facet_wrap(~sites)
g0+geom_boxplot()

The aqm package includes a quick way of forming a confidence interval plot.

library(aqm)
ci(g0)
## Warning: `fun.y` is deprecated. Use `fun` instead.

Statistical analysis using simple two way anova

mod<-lm(data=d,dbh~sites*species)
summary(mod)
## 
## Call:
## lm(formula = dbh ~ sites * species, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7128 -1.7015  0.0183  1.7090  8.3320 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             15.0680     0.5998  25.120   <2e-16 ***
## sitessite_2             10.1752     0.8483  11.995   <2e-16 ***
## sitessite_3             29.2013     0.8483  34.423   <2e-16 ***
## speciesCon              10.6475     0.8483  12.551   <2e-16 ***
## sitessite_2:speciesCon  -1.1722     1.1997  -0.977    0.330    
## sitessite_3:speciesCon   1.0532     1.1997   0.878    0.381    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.999 on 144 degrees of freedom
## Multiple R-squared:  0.9547, Adjusted R-squared:  0.9531 
## F-statistic: 606.4 on 5 and 144 DF,  p-value: < 2.2e-16

The model shows no significant interaction between species and sites, which is unrealistic in practice, but represents the assumptions made when setting up the dummy data. We will look at the meaning of statistical interactions later in the unit.

More complex version with interactions

There are many ways of making a more complex pattern of data using R. These do tend to involve using some additional coding tricks.

Let’s set up the data frame again for reference.

sites<-rep(c("site_1","site_2","site_3"),each=50)
species<-sample(c("Bl","Con"),150,replace = TRUE)
d<-data.frame(sites,species)
str(d)
## 'data.frame':    150 obs. of  2 variables:
##  $ sites  : Factor w/ 3 levels "site_1","site_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ species: Factor w/ 2 levels "Bl","Con": 1 1 1 2 2 1 1 1 2 1 ...

One way of adding in an expected diameter for each combination of site and species is to produce a data frame containing the expected values separately.

expected<-data.frame(sites=rep(c("site_1","site_2","site_3"),each=2), 
                  species= rep(c("Bl","Con"),times=3), dbh=c(20,22,30,25,35,32))

expected
##    sites species dbh
## 1 site_1      Bl  20
## 2 site_1     Con  22
## 3 site_2      Bl  30
## 4 site_2     Con  25
## 5 site_3      Bl  35
## 6 site_3     Con  32

Now we can use dplyr to join the two data frames. Joining data frames is a powerful way of working with more complex data sets. For a join the work smoothly using the default settings there must be some (but not all) columns in the two data frames being joined that exactly match both in terms of names and data type.

d %>% right_join(expected) -> d
head(d)
##    sites species dbh
## 1 site_1      Bl  20
## 2 site_1      Bl  20
## 3 site_1      Bl  20
## 4 site_1     Con  22
## 5 site_1     Con  22
## 6 site_1      Bl  20

To add the randmom noise we will use another dplyr trick. By applying a function in a rowise fashion we can use rnorm to produce one random draw from a normal distribution with a mean determined by the dbh for the row and a given standrad deviation (3)

d %>% rowwise() %>% mutate(dbh=round(rnorm(1,dbh,3),1)) -> d
head(d)
## # A tibble: 6 x 3
## # Rowwise: 
##   sites  species   dbh
##   <fct>  <fct>   <dbl>
## 1 site_1 Bl       17.9
## 2 site_1 Bl       18.6
## 3 site_1 Bl       15.2
## 4 site_1 Con      22.1
## 5 site_1 Con      18.2
## 6 site_1 Bl       23.5
library(ggplot2)
g0<-ggplot(d,aes(y=dbh,x=species)) + facet_wrap(~sites)
g0+geom_boxplot() 

Plotting the confidence intervals.

ci(g0)
## Warning: `fun.y` is deprecated. Use `fun` instead.

A more flexible and extendable method using the tidyverse

In this example a data frame with a single value for each combination is set up intially. A column holding a specific standard deviation is also added. This allows heterogeneity of variance to be included.

diams<-data.frame(sites=rep(c("site_1","site_2","site_3"),each=2), 
                  species= rep(c("Bl","Con"),times=3), mean=c(20,22,30,25,35,32),sd=c(1,3,2,4,3,5)) 

diams
##    sites species mean sd
## 1 site_1      Bl   20  1
## 2 site_1     Con   22  3
## 3 site_2      Bl   30  2
## 4 site_2     Con   25  4
## 5 site_3      Bl   35  3
## 6 site_3     Con   32  5

Now this time we’ll just sample randomly from this data frame. This results in differening numbers of trees for each combination.

diams<-sample_n(diams,200,replace=TRUE)
head(diams)
##    sites species mean sd
## 1 site_3      Bl   35  3
## 2 site_2     Con   25  4
## 3 site_3      Bl   35  3
## 4 site_3      Bl   35  3
## 5 site_3      Bl   35  3
## 6 site_1     Con   22  3
diams %>%  rowwise() %>% mutate(dbh=round(rnorm(1,mean,sd),1)) -> diams

Simulating heights

The relationship between height and diameter is likely to be non-linear. It is also going to be related to the type of tree (conifers are usually higher for a given diameter). We’ll ignore the non linearity, but condition on species.

h_func<-function(species, diam){
  if (species == "Bl") height<-2+diam*0.4 +rnorm(1,0,3)
  if (species == "Con") height<-5+diam*0.5 +rnorm(1,0,3)
  round(height,1)
}
diams %>%  rowwise() %>% mutate(height=h_func(species,dbh)) -> diams
diams
## # A tibble: 200 x 6
## # Rowwise: 
##    sites  species  mean    sd   dbh height
##    <fct>  <fct>   <dbl> <dbl> <dbl>  <dbl>
##  1 site_3 Bl         35     3  32.8   17.4
##  2 site_2 Con        25     4  14.6   17.7
##  3 site_3 Bl         35     3  35.6   15.7
##  4 site_3 Bl         35     3  31.7    8.4
##  5 site_3 Bl         35     3  33.7   15.1
##  6 site_1 Con        22     3  24.6   17.9
##  7 site_1 Con        22     3  21.8    8.6
##  8 site_1 Bl         20     1  19.2    9.8
##  9 site_1 Bl         20     1  19.3    7.3
## 10 site_3 Con        32     5  36.3   19.5
## # … with 190 more rows
ggplot(diams,aes(x=dbh,y=height,col=species)) + geom_point() + geom_smooth(method="lm")