Assignment data set

This data set is based around a real life study in Poole harbour. The data have been altered slightly in order to make it easier to apply the methods on the course and test your undertstanding of statistical issues.

A key question that was of interest to the PhD student conducting the study regarded the effect of the green algal mat that can be seen growing over the mudflats. The mat coverage may be an effect of eutrophication of the estuary and so could have an impact on the population of waders in Poole harbour through its knock on effect on key food species within the trophic chain.

Your task is to investigate the relationship between algal mat cover, expressed as a percentage, and ragworm numbers expressed as a count. Ragworm are an important food source for many waders.

The main problem the researcher faced is that other factors also influence ragworm numbers. They in turn may be related to algal mat cover.

This is therefore a deceptively simple, but actually quite tricky ecological data set. The question does not involve any deep theoretical concepts. The results are purely emperical. However if this were the real data the implications of the analysis could be used to influence policy with regards to the control of eutrophication in Poole harbour.

The data are challenging to analyse, as multiple colinearity is involved, causality cannot really be established and no direct experimental manipulation was used. There may be other complicating elements to take into account such as the fact the measurements are counts and responses may be curvilinear rather than straight lines.

There is no single “right” way to address an observational data set like this. Assumptions of statistical models will always be violated to some extent. However, with care, it is possible to find a justifiable model.

I expect you to apply a range of the skills learned on the course including data visualisations and model diagnostics to tease apart the data and come up with the most defensible model, given the real life challenges that data sets such as these represent. The data set is similar in structure to those used in the exercises on the course and so should be fairly familiar. It should not be too time consuming to actually run the analyses. The main challenge is to find how the variables are acting in combination.

The first steps will involve bivariate analyses in order to look at relationships between the variables. Note that there are no categorical variables involved, so unless a variable is pre-classified into discrete levels (e.g. High vs Low) or analysed using recursive partitioning (tree models) there is no possibility of analysing interactsions. There are also no features in the data set that represent blocking variables or random effects. You may wish to suggest a more robust, alternative design that could be used to take potential lack of independence into account in the discussion (see suggestions for topics below).

Models can be built that include all the variables, or subsets of the variables. Be careful to find the right family to represent the stochastic variability (\(\epsilon\)) used in the models. There is no guarantee that relationships are linear (in the sense of being represented by straight lines plotted through scatter plots). Holding for confounding effects is likely to be necessary, and this may influence both the strength and form of relationships.

A small subset of the potential variables are provided, in order to simplify analysis.

  1. mat: Percent cover of algal mat
  2. salinity: Salinity of the water column measured in pppm.
  3. grain: Mean grain size in micrometers
  4. rag: Count of number of ragworm heads (Worms break into pieces so the number usually refers to heads and jaws)

Running the actual analysis should not take long, once the main underlying patterns have been determined. There are no prizes for finding significance. In fact, given the realistic level of variability and the potential for confounding, very few relationships are likely to be significant. Showing this clearly is just as useful as showing significance.

The discussion should include a critique of the design with the aim of suggesting a follow up study that could, at least partially, overcome some of the statistical issues. Questions may include.

  1. How could a more robust, pseudo-experimental, approach have been taken?
  2. How could spatially explicit analysis help to address the issue more robustly?
  3. How could the response be measured more effectively in order to provide more informative information?
d<-read.csv("/home/aqm/course/data/aqm2018_dataset.csv")
str(d)
## 'data.frame':    200 obs. of  4 variables:
##  $ mat     : int  100 36 47 75 56 12 72 13 1 99 ...
##  $ salinity: int  15 27 33 17 13 28 23 30 34 13 ...
##  $ grain   : int  210 265 273 236 225 265 244 240 260 239 ...
##  $ rag     : int  4 2 0 2 7 0 3 0 1 11 ...