Assignment data set

This data set is based around a real life study in Poole harbour. The data have been altered slightly in order to make it easier to apply the methods on the course and test your undertstanding of statistical issues.

A key question that was of interest to the PhD student conducting the study regarded the effect of the green algal mat that can be seen growing over the mudflats. The mat coverage may be an effect of eutrophication of the estuary and so could have an impact on the population of waders in Poole harbour through its knock on effect on key food species within the trophic chain.

Your task is to investigate the relationship between algal mat cover, expressed as a percentage, and ragworm numbers expressed as a count. Ragworm are an important food source for many waders.

The main problem the researcher faced is that other factors also influence ragworm numbers. They in turn may be related to algal mat cover.

This is therefore a deceptively simple, but actually quite tricky ecological data set. The question does not involve any deep theoretical concepts. The results are purely emperical. However if this were the real data the implications of the analysis could be used to influence policy with regards to the control of eutrophication in Poole harbour.

The data are challenging to analyse, as multiple colinearity is involved, causality cannot really be established and no direct experimental manipulation was used. There may be other complicating elements to take into account such as the fact the measurements are counts and responses may be curvilinear rather than straight lines.

There is no single “right” way to address an observational data set like this. Assumptions of statistical models will always be violated to some extent. However, with care, it is possible to find a justifiable model.

I expect you to apply a range of the skills learned on the course including data visualisations and model diagnostics to tease apart the data and come up with the most defensible model, given the real life challenges that data sets such as these represent. The data set is similar in structure to those used in the exercises on the course and so should be fairly familiar. It should not be too time consuming to actually run the analyses. The main challenge is to find how the variables are acting in combination.

The first steps will involve bivariate analyses in order to look at relationships between the variables. Note that there are no categorical variables involved, so unless a variable is pre-classified into discrete levels (e.g. High vs Low) or analysed using recursive partitioning (tree models) there is no possibility of analysing interaction. You may wish to suggest a more robust, alternative design that could be used to take potential lack of independence into account in the discussion (see suggestions for topics below).

Models can be built that include all the variables, or subsets of the variables. Be careful to find the right family to represent the stochastic variability (ϵ) used in the models. There is no guarantee that relationships are linear (in the sense of being represented by straight lines plotted through scatter plots). Holding for confounding effects is likely to be necessary, and this may influence both the strength and form of relationships.

A small subset of the potential variables are provided, in order to simplify analysis.

Running the actual analysis should not take long, once the main underlying patterns have been determined. There are no prizes for finding significance. In fact, given the realistic level of variability and the potential for confounding, very few relationships are likely to be significant. Showing this clearly is just as useful as showing significance.

Remember that the key question concerns the effect of the mat on ragworms. The other variables are essentially confounders.

The discussion should include a critique of the design with the aim of suggesting a follow up study that could, at least partially, overcome some of the statistical issues. Questions may include.

The code below will load the data and form a data frame called rag.

library(aqm)
data("rag")

For simplicity you might work with a datframe called d

d<-rag