Introduction

One of the most common types of ecological studies involves the collection of data using quadrats, transects, mud cores or some other systematic measuring protocol that allows the abundance of species to be measured or estimated within a sampling unit. Note that although it is tempting to call a quadrat or transect a “sample” this is not formally correct. In statsistical termshe sample consists of all the quadrats taken together. These data can be combined with other data sources to allow a wide range of ecological questions to be addressed. Typically we are interested in the realtionships between the species and environmental variables measured at the sites where the species are found.

The vegan (from vegetation analysis) package in R has a wide range of functions that can be applied to such data, including a many for multivariate analysis of ecological community data.

library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-4
library(aqm)
## 
## Attaching package: 'aqm'
## The following object is masked from 'package:stats':
## 
##     dt
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The first step towards applying these methods is to format the data in a conventional manner. This enables the results of field work can be input to the functions. The vegan package contains a classic data set consisting of 20 quadrats placed in Dutch dune meadows as an example. These data are included as they were first used by Ter Braak to demonstrate the Canoco software that the R functions are based on.

The sites by species matrix

Most analystical methods for multiple species abundances take a matrix as the direct input. Note: You can download the example tables as csv or Excel files in order to use them as templates for your own data collection

data(dune)
dt(dune)

Although such a matrix can be treated as a data frame in fact the formal data structure of a matrix is actually rather different. A correctly designed data frame has only one column per variable. A sites by species matrix will have as many columns as there are species. Each column contains measurements of the same type. This breaks the rule of using only one column per measurement type that should always be used when deciding how to collect and store your data.

If you tried to collect your data using the sites by species matrix format you would either have to design a wide spreadsheet or to draw up a wide table in a field notebook. You would have to know how many species you are going to find before starting and place their names as column headers.

Although this would work for a Dutch meadows with relatively few species it is rather unwieldy and impractical. This is particularly apparent for tropical studies with many hundreds of potential species. It is also a very awkward format to use if you are taking several different measures of abundance or dominance at the same time (e.g. count of numbers of individuals and total biomass). This would require starting a new data table for each measurement. Many species do not occur in many quadrats so the table will consist mainly of blank entries. If you decide to simplify the table to measures of abundance of genus, species of functional type there is no obvious way of automating the process as a function and you may have to carry out a lot of tedious calculations.

The long format

The simplest, and formally correct, way to collect and hold quadrat (or transect etc) data is to use a conventional long data frame format with one column per variable.

The aqm package has an example derived from the vegan dune data.

data("dune_long")
dt(dune_long)

This is much more compact. There are no spaces for missing data. Each record consists of an indicator of the quadrat, transect etc from which it was taken, a consistent species id and one or more abundance measures.

Long format to matrix format

There are many different ways to produce a species by sites matrix in R from the long format version. The tidyverse set of packages have provided at least three different approaches to the task. The most recent form of syntax is to use the pivot_wider function. Notice that it is often a good idea to arrange the columns and rows first by species and then by quadrat.

dune_long %>% 
  arrange(species) %>% 
  pivot_wider(id_cols= quad,names_from = species,values_from = abundance,   values_fill = list(abundance=0)) %>% 
  arrange(quad)-> dune_wide

dt(dune_wide)

This data frame has an explicit column for the quadrat identity rather than simply assuming that the rows are in quadrat order. This avoids errors when combining site by species matricies with site by environmental variable matrices.

Reverse pivot (long to wide)

The reverse procedure will return the long format of the data.

dune_wide %>%
  pivot_longer(cols=-quad,names_to="species",values_to = "abundance") %>% filter(abundance>0) %>% dt()

The environment matrix

The environmental variables recorded for each quadrat are held in a data frame with a common identifying column.

data("dune_env")
dt(dune_env)

Including a column with a unique quadrat identifier for both data frames makes it easy to safely merge the two sets of data without worrying that a change in the order of the rows may result in the two sets of data failing to line up correcly. The base R merge command defaults to looking for a column with the same name in each of the data frames to merge. The colummn must contain unique identifiers for rows in the data frames. It is very important to ensure that the identifiers are unique and that there are no other columns with shared names. If this is the case then the two data frames will be lines up with each other to form one wider data frame.

merge(dune_env,dune_wide) -> env_species
dt(env_species)

Forming matrices

After the merge operation we can cut out two matrices representing species by sites and environmental variables by sites.

str(env_species)
## 'data.frame':    20 obs. of  36 variables:
##  $ quad      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Manure    : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 3 5 5 3 3 4 4 2 2 ...
##  $ Use       : Ord.factor w/ 3 levels "Hayfield"<"Haypastu"<..: 2 2 2 2 1 2 3 3 1 1 ...
##  $ Management: Factor w/ 4 levels "BF","HF","NM",..: 4 1 4 4 2 2 2 2 2 1 ...
##  $ Moisture  : Ord.factor w/ 4 levels "1"<"2"<"4"<"5": 1 1 2 2 1 1 1 4 3 2 ...
##  $ A1        : num  2.8 3.5 4.3 4.2 6.3 4.3 2.8 4.2 3.7 3.3 ...
##  $ Achimill  : num  1 3 0 0 2 2 2 0 0 4 ...
##  $ Agrostol  : num  0 0 4 8 0 0 0 4 3 0 ...
##  $ Airaprae  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Alopgeni  : num  0 2 7 2 0 0 0 5 3 0 ...
##  $ Anthodor  : num  0 0 0 0 4 3 2 0 0 4 ...
##  $ Bellpere  : num  0 3 2 2 2 0 0 0 0 2 ...
##  $ Bracruta  : num  0 0 2 2 2 6 2 2 2 2 ...
##  $ Bromhord  : num  0 4 0 3 2 0 2 0 0 4 ...
##  $ Callcusp  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Chenalbu  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cirsarve  : num  0 0 0 2 0 0 0 0 0 0 ...
##  $ Comapalu  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Eleopalu  : num  0 0 0 0 0 0 0 4 0 0 ...
##  $ Elymrepe  : num  4 4 4 4 4 0 0 0 6 0 ...
##  $ Empenigr  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyporadi  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Juncarti  : num  0 0 0 0 0 0 0 4 4 0 ...
##  $ Juncbufo  : num  0 0 0 0 0 0 2 0 4 0 ...
##  $ Lolipere  : num  7 5 6 5 2 6 6 4 2 6 ...
##  $ Planlanc  : num  0 0 0 0 5 5 5 0 0 3 ...
##  $ Poaprat   : num  4 4 5 4 2 3 4 4 4 4 ...
##  $ Poatriv   : num  2 7 6 5 6 4 5 4 5 4 ...
##  $ Ranuflam  : num  0 0 0 0 0 0 0 2 0 0 ...
##  $ Rumeacet  : num  0 0 0 0 5 6 3 0 2 0 ...
##  $ Sagiproc  : num  0 0 0 5 0 0 0 2 2 0 ...
##  $ Salirepe  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Scorautu  : num  0 5 2 2 3 3 3 3 2 3 ...
##  $ Trifprat  : num  0 0 0 0 2 5 2 0 0 0 ...
##  $ Trifrepe  : num  0 5 2 1 2 5 2 2 3 6 ...
##  $ Vicilath  : num  0 0 0 0 0 0 0 0 0 1 ...
env_mat<-env_species[,2:6]
sp_mat<-env_species[,7:36]

Now the matrices can be used in any of the analyses included in vegan. For example the following code runs an analysis of species composition similarities between usage types. The analysis can be intepreted as not providing evidence of a difference in composition attributable to usage class, but the usual caveats regarding sample size and heterogeneity in the explanatory variable must be taken into account.

dist <- vegdist(sp_mat)
ano <- anosim(dist, env_mat$Use)
summary(ano)
## 
## Call:
## anosim(x = dist, grouping = env_mat$Use) 
## Dissimilarity: bray 
## 
## ANOSIM statistic R: 0.05046 
##       Significance: 0.212 
## 
## Permutation: free
## Number of permutations: 999
## 
## Upper quantiles of permutations (null model):
##   90%   95% 97.5%   99% 
## 0.101 0.146 0.180 0.222 
## 
## Dissimilarity ranks between and within classes:
##            0%    25%    50%     75% 100%   N
## Between   1.0 45.500  98.50 145.500  188 131
## Hayfield 15.0 68.000 107.00 134.000  181  21
## Haypastu  3.0 36.250  80.25 117.875  188  28
## Pasture  27.5 53.875  63.50 158.375  167  10
plot(ano)

Adding species traits and phyologenetic detials

The advantages of using the correct long data format are very apparent if you want to use additional details on species characteristics in your analysis. For example you may want to simplify the data to genus or family rather than species.

data(dune_taxon)
dt(dune_taxon)
merge(dune_long,dune_taxon) %>% arrange(quad) -> dune_long_taxon
dt(dune_long_taxon)
dune_long_taxon %>% arrange(family) %>% 
  pivot_wider(id_cols= quad,names_from = family,values_from = abundance,   values_fill = list(abundance=0), values_fn = list(abundance=sum)) %>% 
  arrange(quad) -> fam_mat

dt(fam_mat)

Functional diversity analysis

data(dune_traits)
dune_traits %>% arrange(species) -> dune_traits
dt(dune_traits)
traits_mat<-dune_traits[,-1]
row.names(traits_mat)<-dune_traits$species
names(sp_mat)%in%row.names(traits_mat)
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
sp_mat<-sp_mat[,names(sp_mat)%in%row.names(traits_mat)]
library(SYNCSA)
rao.diversity(sp_mat, traits = traits_mat)
## $call
## rao.diversity(comm = sp_mat, traits = traits_mat)
## 
## $Simpson
##  [1] 0.7345679 0.8900227 0.8684211 0.8934559 0.9077930 0.8900227 0.9002770
##  [8] 0.9016620 0.9050000 0.8958953 0.8469388 0.8491155 0.8521579 0.8000000
## [15] 0.8254848 0.7840237 0.8355556 0.8526077 0.8571429 0.8229167
## 
## $FunRao
##  [1] 0.3599990 0.4016799 0.3810391 0.4162532 0.4433912 0.4111376 0.4298531
##  [8] 0.3818508 0.4539125 0.4031509 0.3973762 0.3701427 0.3571318 0.3437445
## [15] 0.3690043 0.3602473 0.4031852 0.4253193 0.4195349 0.3902108
## 
## $FunRedundancy
##  [1] 0.3745689 0.4883428 0.4873820 0.4772027 0.4644018 0.4788851 0.4704239
##  [8] 0.5198113 0.4510875 0.4927444 0.4495626 0.4789728 0.4950261 0.4562555
## [15] 0.4564805 0.4237764 0.4323704 0.4272884 0.4376079 0.4327059