Statistical modelling. Regression as a general linear model

Exercises

Climate data

When analysing time series it can be very important to take into account serial autocorrelation. However if serial autocorrelation is not particularly strong simple regression analysis can provide reliable insight regarding trends.

The following data are taken from the UK Meterological office. They represent total precipitation and mean monthly temperatures averaged over climate stations in England. Is there any evidence of trends?

library(ggplot2)
library(dplyr)
library(reshape2)

Temp<-read.csv("https://tinyurl.com/aqm-data/Temp.csv")
Prec<-read.csv("https://tinyurl.com/aqm-data/Prec.csv")

Temperature

library(DT)
datatable(Temp)

Notice that the data is in a “wide” format with one column per month. This often is the way climate data is presented, but as you know by now it is not a classic data frame format. We’ll see how to deal with this later.

Precipitation

datatable(Prec)

Exercise one

Choose one of the months for analysis. Start a new markdown document. Carefully cut and paste the relevant chunks from this document into it. Now, using the code from the regression report, design a regression analysis to look at linear trends in temperature and precipitation over the last 100 years.

Making long format data

There are now many different ways in R to reshape data frames into consistent formats. The new “tidyr” and dplyr are becoming increasingly popular. A simple way of stacking many columns into two (variable and value) is provided by the reshape package. The melt function takes the original data frame and arguments defining “id” variables and “measurement” variables.

Temp2<-melt(Temp[,3:15],id="Year")
Prec2<-melt(Prec[,3:15],id="Year")

str(Temp2)

## 'data.frame':    1236 obs. of  3 variables:
##  $ Year    : int  1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 ...
##  $ variable: Factor w/ 12 levels "JAN","FEB","MAR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value   : num  3.2 3.4 3.3 3.9 2.9 3.5 6.9 1.2 3.3 2.5 ...

All the months can now be plotted on a single figure using ggplot2.

g0<-ggplot(Temp2,aes(x=Year,y=value))
g0+geom_point()+geom_smooth(method="lm")+facet_wrap("variable")

Calculating yearly mean temperature and total rainfall

Temp2 %>% group_by(Year) %>% summarise(mean=mean(value)) ->TMean
Prec2 %>% group_by(Year) %>% summarise(sum=sum(value)) -> PTot

Now look at the trends for the yearly data. You may want to subset the data to look only at the last fifty years.

TMean67<-subset(TMean,TMean$Year>1967)
PTot60<-subset(PTot,PTot$Year>1967)

Recall that in the previous class we looked at m easurements on mussel shell length and body tissue volume in ml.

Load the data and analyse them using linear regression.

mussels<-Prec<-read.csv("https://tinyurl.com/aqm-data/mussels.csv")

Can you work out how to run the regression for each site?

The statistician Francis Anscombe invented four data sets in order to demonstrate some of the pitfalls involved in running regression analysis blind (without either looking at the data or carrying out diagnostics).The four data sets are provided in R for illustration.

data(anscombe)
str(anscombe)

## 'data.frame':    11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...

d<-anscombe

We can fit models to x1,y1, x2,y2 etc

mod1<-lm(data=d,y1~x1)
mod2<-lm(data=d,y2~x2)
mod3<-lm(data=d,y3~x3)
mod4<-lm(data=d,y4~x4)

The summaries of the models look very similar.

summary(mod1)

## 
## Call:
## lm(formula = y1 ~ x1, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

summary(mod2)

## 
## Call:
## lm(formula = y2 ~ x2, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## x2             0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

summary(mod3)

## 
## Call:
## lm(formula = y3 ~ x3, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

summary(mod4)

## 
## Call:
## lm(formula = y4 ~ x4, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## x4            0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

However only one of these data sets is really suitable for linear regression. Run the appropriate diagnostics to find out which.