library(ggplot2)
library(dplyr)
library(mgcv)
library(DT)
theme_set(theme_bw())
dt<-function(x) DT::datatable(x, 
    filter = "top",                         
    extensions = c('Buttons'), options = list(
    dom = 'Blfrtip',
    buttons = c('copy', 'csv', 'excel'), colReorder = TRUE
  ))

Introduction

This markdown document can be adapted to run a regression analysis on any pair of variables in the Arne data set. You can also use it to run analyses on sub sets of the data.

Loading the data

d<-read.csv("/home/rstudio/webpages/Quantitative_and_Spatial_Analysis/Quantitative_and_Spatial_2018/Arne_project/gis_data/study_points.csv")

Subsetting

If you want to run an analysis for a single site (factor level) at a time, you can get a dataframe for just factor level.

sA<-subset(d,d$site == "A")
sA<-droplevels(sA)
sB<-subset(d,d$site == "B")
sB<-droplevels(sB)
lasty<-subset(d,d$site == "last_year")

Choose your data and variables

Here you can choose whether you are working with all the data (d) or with one of the subsets. Just remove the options you don’t want leaving the one you are working with.

df<-sA
df<-sB
df <- lasty
df<-d ## By default we will use all the data

Choose which is your x and y variable

This example will look at pine density against insolation. You can change the variables in this chunk and the lablels when running any other analyses.

df$x<-df$sol
df$y<-df$pine_density
xlabel<-"Daily insolation (kWh / m²)"
ylabel <-"Density of pine saplings (n/m²)"

Data summaries for individual variables

Change the name of the variable to match a numerical variable in your own data set. The command removes NAs just in case you have them

summary(df$y,na.rm=TRUE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.557   1.273   1.417   2.308   4.138

Confidence interval for the mean of y

confint(lm(df$y~1))
##                2.5 %   97.5 %
## (Intercept) 1.268132 1.566388

Simple boxplot of one variable

Useful for your own quick visualisation.

boxplot(df$y)

histogram of one variable

This uses ggplot. Change the binwidth if you want to use this.

g0<-ggplot(d,aes(x=df$y))
g0+geom_histogram(color="grey",binwidth = 0.5)+xlab(ylabel)

Regression

In this data set there are two numerical variables. So we can run a linear regresion.

Scatterplot without fitted line

g0<-ggplot(df,aes(x=x,y=y))
g0+geom_point() +xlab(xlabel) +ylab(ylabel)

Scatterplot with fitted line and labels

g0<-ggplot(df,aes(x=x,y=y))
g1<-g0+geom_point() + geom_smooth(method="lm") 
g1 + xlab(xlabel) + ylab(ylabel)

Fitting a model

Change the names of the variables in the first line.

mod<-lm(data= df, y~x)

Model summary

summary(mod)
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44012 -0.86329 -0.09709  0.86738  2.65232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   2.1513     0.7383   2.914  0.00408 **
## x            -0.4150     0.4153  -0.999  0.31910   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9642 on 161 degrees of freedom
## Multiple R-squared:  0.006165,   Adjusted R-squared:  -7.39e-06 
## F-statistic: 0.9988 on 1 and 161 DF,  p-value: 0.3191

Model anova table

anova(mod)
## Analysis of Variance Table
## 
## Response: y
##            Df  Sum Sq Mean Sq F value Pr(>F)
## x           1   0.928 0.92850  0.9988 0.3191
## Residuals 161 149.667 0.92961

Confidence intervals for the model parameters

confint(mod)
##                  2.5 %    97.5 %
## (Intercept)  0.6932133 3.6093523
## x           -1.2351364 0.4050692

Model diagnostics

Look at the regression handout to understand these plots.

plot(mod,which=1)

plot(mod,which=2)

plot(mod,which=3)

plot(mod,which=4)

plot(mod,which=5)

Spearman’s rank correlation

Used if all else fails. Not needed with these data, but included for reference.

g0<-ggplot(df,aes(x=rank(x),y=rank(y)))
g0+geom_point() + geom_smooth(method="lm") +xlab(paste("Rank",xlabel)) +ylab(paste("Rank",ylabel))

cor.test(df$x,df$y,method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  df$x and df$y
## S = 813720, p-value = 0.1051
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1274083

Fitting a spline

Only use if you suspect that the relationship is not well described by a straight line.

library(mgcv)

g0<-ggplot(df,aes(x=x,y=y))
g1<-g0 + geom_point() + geom_smooth(method="gam", formula =y~s(x))
g1 + xlab(xlabel) + ylab(ylabel)

In this case the line is the same as the linear model. Get a summary using this code.

mod<-gam(data=df, y~s(x))
summary(mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.41726    0.07507   18.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df     F p-value
## s(x) 1.749  2.188 1.245   0.284
## 
## R-sq.(adj) =  0.0119   Deviance explained = 2.26%
## GCV = 0.93426  Scale est. = 0.9185    n = 163

If you do use this model remember that its only needed if you can’t use linear regression. Report the ajusted R squared value, the estimated degrees of freedom and the p-value for the smooth term (not the intercept). If the spline is significant you must include the figure in your report, as that is the only way to show the shape of the response.