Chapter 17 Practical advice on using regression

17.1 Introduction

Do not attempt to run through any of the code shown here yourself. This section is an illustration of the concepts involved in running and interpreting a regression. The code, as shown, hides some of the chunks that generated the illustrative data. So the code will not run without data.

The previous handout goes through the theory behind regression in some depth. However it not strictly necessary to understand every aspect underlying the theory of fitting a regression line in order to use the technique. There are three related reasons to use regression. The same analysis is used in all three cases, but different aspects of the output are emphasised in order to address the scientific question.

  1. To calibrate some form of predictive model. This is most effective when a large proportion of the variance is explained by the regression line. For this to be effective we require reasonably high values of R² and emphasis falls on the form of the model and confidence intervals for the key parameter, which is the slope of the fitted line. This is a measure of how much the dependent variable (plotted on the y axis) changes with respect to each unit change in the independent variable (plotted on the x axis).

  2. To investigate the proportion of variance “explained” by the fitted model. The R² value provides a measure of the variance explained. In the context of explanation this is the quantity that is emphasised. The R² value is a measure of how much scatter there is around the fitted relationship.

  3. To test whether there is any evidence at all of a relationship. In this context the statistical significance provided by the p-value is the useful part of the output, in the context of rejecting the null hypothesis if no relationship at all (slope=0). If the sample size is small the evidence for a relationship will be weaker than with a large sample size. Low R² values (i.e. high levels of scatter) also make it more difficult to detect any evidence of a trend.

17.2 Predictive modelling: Calibration

If everything goes well you could be looking a very clear, linear trend with narrow confidence intervals. This may occur with morphological data where the width of some feature is closely related to the length, or in the context of some calibration experiment.

17.2.1 Plotting the model

Assuming you have a dataframe with a variable called x and another called y.

g0<-ggplot(d,aes(x=x,y=y))
g0 + geom_point() +geom_smooth(method="lm")

17.2.2 Summarising the model

The model is fitted and summarised using this code.

mod<-lm(data=d,y~x)
summary(mod)
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.495  -7.885  -1.291   7.528  22.340 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 102.34238    2.32948   43.93   <2e-16 ***
## x             1.94754    0.04005   48.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.56 on 98 degrees of freedom
## Multiple R-squared:  0.9602, Adjusted R-squared:  0.9598 
## F-statistic:  2365 on 1 and 98 DF,  p-value: < 2.2e-16

You should be able to look at this output and write a line that looks like this.

The estimated regression equation is y= 102.34 + 1.95x, R² = 0.96 P-value < 0.001

17.2.3 Diagnostic checks

The performance library provides a simple way of running visual diagnostic checks. (Lüdecke et al. 2021).

performance::check_model(mod)

The figure provides some textual advice on what to look for. The assumptions do not all need to be met precisely. When sample sizes are small it may be difficult to spot any violations of the assumptions as the figures will be not very well formed. It may be particularly hard to see if the distribution of the residuals is approximately normal. The QQplot may be the best tool for this.

performance::check_model(mod, check="qq")

Note that although the dots do not fall cleanly along the line they do fall within the confidence intervals. No major violation here. Also although the assumption of normally distributed residuals is a formal necessity for regression, minor violations are generally unimportant.

Do remember that the assumption of normality applies only to the residuals of the model,not to the data themselves. The rationale for looking at the distribution of the data is to check whether any transformations are required to give the data better statistical properties, not to check the assumptions used when fitting the statistical model itself (Zuur, Ieno, and Elphick 2009)

17.2.4 Advice on writing up

In this situation you should concentrate your write up on interpreting the implications of the slope of the line. You have found a very clear relationship between x and Y. State this. Mention that diagnostic checks showed no violations of the assumptions needed for linear regression. Include the (narrow) confidence intervals for the parameters of the model. Discuss how the quantitative relationship may be used in research.

Remember that the slope represents the change in the value of y for every unit change in x. This can be an important parameter and it does have confidence intervals.

confint(mod)
##                 2.5 %    97.5 %
## (Intercept) 97.719605 106.96516
## x            1.868064   2.02701

State the R² value and also make sure that you have clearly stated the sample size, defined the sample space and the methods used to measure the variables in the methods section. You should also state that the p-value was <0.001 but you don’t need to say very much about significance, as you are not interested in the probability that the data may be the result of chance when the scatterplot shows that the relationship is so clear cut. You can mention that diagnostic checks were conducted using the performance package (Lüdecke et al. 2021) in R (R Core Team 2020a). The important part of your discussion should revolve around the fitted line itself. You may suspect that the functional form for the relationship is not in fact a straight line. In future studies you might use nonlinear regression to model a response, but at this stage you do not yet know how to fit such models formally, so you should just discuss whether a straight line is appropriate. If the slope has a clear interpretation and scientific relevance then discuss the value, with mention of the confidence interval.

17.3 Explanatory modelling:

In many situations the scatter around the line may be much greater that the previous case. This will lead to much less of the variance being explained by the underlying relationship. However there is still a clear relationship visible, providing the sample size is large enough.

The estimated regression equation is Y= 121.9 + 1.59X, R² = 0.18 P-value < 0.001

Note that there are no major violations of the assumptions for running a regresssion. The issue is that the relationship is not very strong. There is comparatively small signal to noise ratio as shown by the \(R^2\) value.

17.3.1 Advice on writing up

In this situation the emphasis falls more on the R² value, which here is comparatively low. You should state the p-value in order to confirm that the relationship is in fact significant. You should think about the reasons behind the scatter. It may be that the measurements themselves included a large amount of random error. However in ecological settings the error is more likely to be either simply unexplainable random noise, or some form of process error. The random noise around the line may in fact be associated with additional relationships with other variables which are not included in the simple model. This can be handled by multiple regression which uses more than one variable for prediction, and by a range of other techniques including modern machine learning. However these are more advanced, so you should concentrate on discussing how the model could be improved rather than finding ways to improve it at this stage.

It is always very important to discuss the sample size used in the study in all cases. Small samples may cause type 2 error. Type 2 error is a failure to reject the null hypothesis of no relationship even though there is some relationship there. It occurs when the sample size is too small to detect a relationship, even though one does in fact exist. Notice that we always fail to reject a null hypothesis. We do not state that there is no relationship. Instead we state that the study failed to provide evidence of a relationship.

17.4 Testing for a relationship

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## x          1  67777   67777   4.798 0.05988 .
## Residuals  8 113009   14126                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A smaller sample taken from the same sample space as the previous data set now does lead to an inconclusive study.

The estimated regression equation is Y= 125.83 + 2.87X, R² = 0.37 P-value = 0.06

Note that there are no major violations of the assumptions for running a regression. The issue is that the relationship is not clear.

17.4.1 Advice on writing up

In the case of a p-value over 0.05 you may want to note that there may be some indication of a relationship but that the results are statistically inconclusive. There is insufficient evidence provided by this particular study to confirm any relationship. The write up should emphasis the p-value and the study size. Include the F ratio 4.8 and the degrees of freedom of 1 on 8 . You should make suggestions regarding how future studies may be improved.

17.5 Diagnostics show assumptions not met

There are many ways in which the assumptions of regression are not met. This is why diagnostics are needed. You should look through the handout for the last class in order to understand this more deeply.

What do we do when assumptions are not met?

There is no simple answer to this. A large number of alternative techniques have been designed to analyse data that do not meet the assumptions used in regression. In this course we cannot look at all of them. So we need some simple rules that will help to make the best of the data without introducing a large number of new concepts.

17.6 Skewed independent variable

Sometimes the independent variable is heavily right skewed.

The estimated regression equation is Y= 115.04 + 2.01X, R² = 0.46 P-value = 0.001

Although you can still run a regression on data such as these, the diagnostic plots will show that many of the assumptions are not met.

A common solution is to transform the independent variable by taking either the square root or the logarithm and then re-run the analysis.

d$logx<-log10(x)

Now the regression looks somewhat better and diagnostic plots will confirm that more assumptions are met.

The estimated regression equation is Y= 101.51 + 42.9log(X), R² = 0.54 P-value < 0.001

17.6.1 Advice on writing up

After a transformation you can write up the results in the same way that you would for any other regression model. However be careful how you interpret the slope and intercept given that you are now using a logarithmic scale.

17.7 Spearman’s rank correlation

A very simple non-parametric analysis is very commonly used when regression assumptions are not met. This is Spearman’s rank correlation.

Although Spearman’s rank correlation can occasionally be useful, it is not a parametric procedure, so would be only used if nothing else is available. It can be very useful as an input to some multivariate models. It can also be the “go to” technique for very small data sets when the only question of interest is whether some relationship can be detected. The best feature of taking ranks is that it down weights data points with high leverage (outliers at the extremes)

17.7.1 Taking ranks

Spearman’s correlation essentially takes the ranks of both of the variables and uses these in a regression. However the procedure does not provide an equation. There are some rather limited circumstances when an actual regression model based on ranks would be useful. This is essentially a non parametric procedure and should only be used if nothing else is possible (Johnson 1995)

d$rankx<-rank(d$x)
d$ranky<-rank(y)

R² = 0.82 P-value < 0.001

17.7.2 Advice on writing up

Although the relationship between the ranked data can be shown on a scatter-plot with a fitted line, the slope and intercept of such a line usually has little meaning. You therefore just emphasise the R² value after noting that it refers to Spearmans rank correlation, not regression, and the p-value. A simple correlation test in R provides the p-value and the correlation coefficient (rho)

cor.test(d$y,d$x,method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  d$y and d$x
## S = 1942, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9067467

So the value of rho is 0.91 and the R squared is 0.82

A better approach may be to use a non linear smoother. See below for an example of how to do this.

17.8 Non linear response

In many data sets the response may not follow a strictly linear trend (as above). Take for example the yearly mean temperature data we looked at previously.

You can fit a linear model to these data and report the results.

## 
## Call:
## lm(formula = temp ~ Year, data = TMean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30245 -0.36770  0.00061  0.38383  1.19188 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.390378   3.436410  -2.442   0.0164 *  
## Year         0.008992   0.001752   5.132 1.39e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 101 degrees of freedom
## Multiple R-squared:  0.2068, Adjusted R-squared:  0.199 
## F-statistic: 26.34 on 1 and 101 DF,  p-value: 1.394e-06

The estimated regression equation is y= -8.39 + 0.009x, R² = 0.21 P-value < 0.001

Note that the slope of the line corresponds to a temperature increase of 0.9 degrees C per century.

However if you look carefully at the scatter-plot and model diagnostics you will see that many of the points fall in clusters on either side of the line. This known as serial autocorrelation in the residuals. It is an indication that the model may be ill formed, in other words that a straight line may not be the best way to describe the underlying relationship. A diagnostic check helps to establish this.

One solution may be to use a non-linear smoother (note .. this is not the same as a non-linear model for a functional response). There are many forms of smoothers available. The theory behind their use can be quite complex, but in general terms they will tend to follow the empirical pattern in the response.

library(mgcv)
g0<-ggplot(TMean,aes(x=Year,y=temp))
g0 + geom_point() + geom_smooth(method="gam", formula =y~s(x)) + geom_smooth(method ="lm", colour="red", se=FALSE)

The smoother used here is taken from the mgcv package and is a generalised additive model (GAM). Note that the R code includes se=FALSE which removes the confidence interval for the linear model. This is used here to avoid clutter when the two models are superimposed, but if only the linear model were used the confidence interval would be added as usual. However it would not be strictly valid due to the important violation of the assumption.

Smoothed models do not use an equation that can be written down. Therefore there is no slope to report. They do produce an R squared value. It is also possible to extract the residuals from the model in order to look at the differences between observations and the fitted smmoth trend.

mod<-gam(data=TMean,temp~s(Year))
summary(mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## temp ~ s(Year)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.243      0.047   196.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df    F p-value    
## s(Year) 5.885  7.044 8.14  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.348   Deviance explained = 38.6%
## GCV = 0.24382  Scale est. = 0.22753   n = 103

17.8.1 Advice on writing up

In this case you need to discuss the pattern of the response as shown in the scatter-plot with fitted spline. The significance of the overall model can be quoted, but the emphasis falls on the shape of the response and “local” features such a break points and changes in slope.

You might note that there was a very clear linear trend between 1975 and 2010. There was a short period of slight cooling between 1940 and 1975. In a data set with very interpretable single values such as this one you may want to mention extremely cool and warm years explicitly.

It is possible to cautiously report the linear trend derived from a regression, but with the caveat that the model does not provide a good fit to the data throughout the range.

References

Johnson, DH. 1995. “Stastical Sirens - the Allure of Non Parametrics.” ECOLOGY 76 (6): 1998–2000. https://doi.org/10.2307/1940733.
Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Philip Waggoner, and Dominique Makowski. 2021. performance: An R Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software 6 (60): 3139. https://doi.org/10.21105/joss.03139.
R Core Team. 2020a. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. 2009. “A Protocol for Data Exploration to Avoid Common Statistical Problems.” Methods in Ecology and Evolution 1 (1): 3–14. https://doi.org/10.1111/j.2041-210x.2009.00001.x.