Peltophorum dubium: Climate niche model diagnosis

Distribution

Convex hulls have been suggested as a means of estimating a species extent of occurence for red listing purposes. However convex hulls tend to be suffer from artefacts due to outlying observations. They may also include large areas of completely unsuitable habitat, including water.

A concave hull is likely to be a much better way to measure the species absolute range size. This model can be fitted to the data points by extending a buffer (100 km) around them in order to unify all those falling within the range and then dissolving the buffer in order to remove it. This leaves polygons that fit the shape of the distribution and do not include non terrestrial areas.

The analysis that has been run here only uses points within the MesoAmerican. If the figure below shows points extending beyond this then the species niche will only be partly tipified. In these cases the alternative analyses run at the continental scale should be consulted.

Background points for modelling have been extracted from the terrestrial area within a 100 km buffer around the convex hull.

plot of chunk range_map

Climate

The figure below shows the standard Water and Leith climate diagram for the mean values of precipitation and temperature extracted from the species presence points.

plot of chunk unnamed-chunk-2

The Walter and Leith diagram assumes that the growing season occurs when rainfall is over 100mm

Soil water and NDVI

A more refined method is to extract the values from a bucket model that keeps track of input to the soil profile through precipitation and reductions in soil moisture through evaptranspiration over the course of the year. This can be compared to changes in NDVI at the collection points.

In both cases the data are normalised to take values between zero and one. Soil moisture may fall below its maximum values without having an effect on NDVI.

An index of seasonality has been calculated as the percentage reduction in the value at the lowest point of the curve.

plot of chunk seasonal

In some cases median NDVI will remain fairly constant, even when the balance model shows that soil water constant is lowered for part of the year. Providing SWC is above 50% of maximum levels the vegetation would not experience a great deal of hydric stress. NDVI values can be highly variable as they are affected by land use. The general trend should be comparable to that shown by modelled soil moisture.

plot of chunk compare

Summary of the variable values at the collection points.

The table below shows the range of values measured for key climatic variables at the collection points. A wide range of values may suggest some erroneous points. Distribution modelling is not sensitive to the presence of a few outliers, but the results may be distorted if the data contains many erroneous points.

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Longitude	-92.00	-89.80	-87.60	-89.10	-87.60	-87.60
Latitude	15.70	18.50	21.20	19.40	21.20	21.20
Elevation	11.00	11.00	11.00	246.00	363.00	715.00
Annual precipitation	1150.00	1150.00	1150.00	1320.00	1410.00	1670.00
Mean annual temperature	24.20	24.70	25.10	24.80	25.10	25.10
Annual temperature range	16.90	16.90	16.90	17.80	18.20	19.60
Total annual actual evapotranspiration	1110.00	1110.00	1110.00	1110.00	1110.00	1110.00
Minimum proportion of available soil moisture	0.25	0.34	0.44	0.37	0.44	0.44

plot of chunk unnamed-chunk-3

Niche space with relation to annual precipitation and mean annual temperature.

It is easy to fit convincing models that apparently predict observed species distributions closely using machine learning algorithms such as RandomForest or Maxent. However on closer inspection the response surfaces that are being used for prediction are often lacking in biological realism. This occurs due to overfitting of models to data derived from a partial exploration of a species abiotic niche. This effect may be attributable to insuficient data leading to observed disjunction in the species range when in reality the species is found accross a wider area. In other cases the species distribution may be limited by barriers to dispersal or stochastic effects that led to aggregation in some areas. Deforestation may also have removed habitat from the centre of the species range. Any of these effects can be spotted as multimodality of the niche space.

The following diagrams show kernel densities one two synthestic climate axes (total annual rainfall and mean annual temperature). If their are signs of multimodality this may indicate that the species has not fully explored its climate niche, or that there are disjunct populations with differing characteristics. The method will not show clear results for species with few collection points.

plot of chunk kernel

Spatial clustering

The same analysis can be run to look at spatial clustering. The kernel densities are smoothed, so will only suggest multimodality if the points are very highly clustered.

plot of chunk spatial_kernel

Finding spatial clusters

The significance of any spatial clusters can be checked using the silhouette width method. The width is calculated for values of k between 2 and 5. If any are higher than 0.52 the analysis will produce a diagram showing the clusters.

## Error: Number of clusters 'k' must be in {1,2, .., n-1}; hence n >= 2

## Error: error in evaluating the argument 'y' in selecting a method for function 'plot': Error: object 'kscores' not found

## Error: object 'kscores' not found

## Error: object 'kscores' not found

## [1] ""

Anosim

If there is evidence that the points fall into at least 2 groups, but fewer than 6 we can look at whether there is significant differences in variability between and within groups in the climatic conditions at the sites using Anosim. This is a sensitive test, as would be MANOVA, so there will often be significant differences. They should only be intepreted as important if R is much larger than 0.3.

## Error: object 'clustering' not found

Gam model using simple environmental variables

A simple model uses mean temperature, temperature range and annual precipitation. The data are binary responses, so a GAM of the binomial family should be used. However as there is no interest in evaluating the statistical significance of a model fit to pseudoabsence data (as sample size is arbitrary) a Gaussian model is used in order to simplify interpretation. Tests show that predictive output from the two models is usually indistinguishable.

plot of chunk unnamed-chunk-6

The output should consist of monotonic or unimodal responses. If these are not observed it indicates potential problems with the model.

ROC analysis using random subset of points

One of the reasons for high AUC values in the literature is the use of random subsets of data taken from within the species known range. Model evaluation using this method always eroneously suggests good discrimination due to spatial autocorrelation. This effect is not removed by trying to control for spatial autocorrelation by taking fewer points. All available points should be used when fitting models. The problem is that more rigorous tests shoudl be used to evaluate model performance.

Mapping the results

The default colouring in R may de-emphasise differences in some cases.

plot of chunk unnamed-chunk-8

Evaluation using a spatial split

Ideally truly independent data should be used for model evaluation. One simple way of testing a model without independent data is to split the values spatially. A model using only the Eastern side of a species range is used to predict the Western side, and vice versa. This usually reveals more weaknesses in the model's predictive ability. AUC values over 0.8 show that the model is very useful as tool for prediction. Values between 0.6 and 0.8 suggest that the model is using climatic variables to narrow the species range to some degree. If values are below this then it may be better to use a purely spatial model to estimate the species distribution.

plot of chunk unnamed-chunk-9

Bucket model GAM.

A gam using temperature range, mean temperature and the annual soil moisture dynamic as input may be more reliable, as the “bucket model” of soil moisture changes should represent patterns of hydric stress throughout the year. Temperature effects can be simplified to mean temperature and the range.

ROC analysis using random data split.

Evaluation using a random split

plot of chunk unnamed-chunk-10

Evaluation using a spatial split

plot of chunk unnamed-chunk-11

Random Forest

In most cases the more complex machine learning algorithms show better discrimination than simpler models when tested against a random subset of the data. This is to be expected given that they are designed to do this job well. However ecological theory suggests that they are unlikely to be reliable predictors of unseen areas of a species range due to overfitting to a spatially defined subset of autocorrelated data. So in most cases the predictions produced are either no better than those from simpler models or worse. In order to predict a species distribution spatial elements should normally be taken into account explicitly.

Random forest evaluation using a random subset

plot of chunk unnamed-chunk-13

Random forest evaluation using a spatial split

plot of chunk unnamed-chunk-14

Conclusions for Peltophorum dubium

The diagnostic analysis is likely to have revealed some issues regarding the reliablity of the model. These issues must be taken into account when applying model results to issues of conservation concern. The model is best regarded as a guide that may indicate the potential climatic limits to a species distribution. The realised distribution may be determined by other factors including biotic interactions and limitation to movement.