library(ggplot2)
library(dplyr)
library(mgcv)
library(DT)
theme_set(theme_bw())
dt<-function(x) DT::datatable(x,
filter = "top",
extensions = c('Buttons'), options = list(
dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel'), colReorder = TRUE
))
This markdown document can be adapted to run a regression analysis on any pair of variables in the Arne data set. You can also use it to run analyses on sub sets of the data.
d<-read.csv("/home/rstudio/webpages/Quantitative_and_Spatial_Analysis/Quantitative_and_Spatial_2018/Arne_project/gis_data/study_points.csv")
If you want to run an analysis for a single site (factor level) at a time, you can get a dataframe for just factor level.
sA<-subset(d,d$site == "A")
sA<-droplevels(sA)
sB<-subset(d,d$site == "B")
sB<-droplevels(sB)
lasty<-subset(d,d$site == "last_year")
Here you can choose whether you are working with all the data (d) or with one of the subsets. Just remove the options you don’t want leaving the one you are working with.
df<-sA
df<-sB
df <- lasty
df<-d ## By default we will use all the data
Choose which is your x and y variable
This example will look at pine density against insolation. You can change the variables in this chunk and the lablels when running any other analyses.
df$x<-df$sol
df$y<-df$pine_density
xlabel<-"Daily insolation (kWh / m²)"
ylabel <-"Density of pine saplings (n/m²)"
Change the name of the variable to match a numerical variable in your own data set. The command removes NAs just in case you have them
summary(df$y,na.rm=TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.557 1.273 1.417 2.308 4.138
confint(lm(df$y~1))
## 2.5 % 97.5 %
## (Intercept) 1.268132 1.566388
Useful for your own quick visualisation.
boxplot(df$y)
This uses ggplot. Change the binwidth if you want to use this.
g0<-ggplot(d,aes(x=df$y))
g0+geom_histogram(color="grey",binwidth = 0.5)+xlab(ylabel)
In this data set there are two numerical variables. So we can run a linear regresion.
g0<-ggplot(df,aes(x=x,y=y))
g0+geom_point() +xlab(xlabel) +ylab(ylabel)
g0<-ggplot(df,aes(x=x,y=y))
g1<-g0+geom_point() + geom_smooth(method="lm")
g1 + xlab(xlabel) + ylab(ylabel)
Change the names of the variables in the first line.
mod<-lm(data= df, y~x)
summary(mod)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.44012 -0.86329 -0.09709 0.86738 2.65232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1513 0.7383 2.914 0.00408 **
## x -0.4150 0.4153 -0.999 0.31910
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9642 on 161 degrees of freedom
## Multiple R-squared: 0.006165, Adjusted R-squared: -7.39e-06
## F-statistic: 0.9988 on 1 and 161 DF, p-value: 0.3191
anova(mod)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 0.928 0.92850 0.9988 0.3191
## Residuals 161 149.667 0.92961
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 0.6932133 3.6093523
## x -1.2351364 0.4050692
Look at the regression handout to understand these plots.
plot(mod,which=1)
plot(mod,which=2)
plot(mod,which=3)
plot(mod,which=4)
plot(mod,which=5)
Used if all else fails. Not needed with these data, but included for reference.
g0<-ggplot(df,aes(x=rank(x),y=rank(y)))
g0+geom_point() + geom_smooth(method="lm") +xlab(paste("Rank",xlabel)) +ylab(paste("Rank",ylabel))
cor.test(df$x,df$y,method="spearman")
##
## Spearman's rank correlation rho
##
## data: df$x and df$y
## S = 813720, p-value = 0.1051
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1274083
Only use if you suspect that the relationship is not well described by a straight line.
library(mgcv)
g0<-ggplot(df,aes(x=x,y=y))
g1<-g0 + geom_point() + geom_smooth(method="gam", formula =y~s(x))
g1 + xlab(xlabel) + ylab(ylabel)
In this case the line is the same as the linear model. Get a summary using this code.
mod<-gam(data=df, y~s(x))
summary(mod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.41726 0.07507 18.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(x) 1.749 2.188 1.245 0.284
##
## R-sq.(adj) = 0.0119 Deviance explained = 2.26%
## GCV = 0.93426 Scale est. = 0.9185 n = 163
If you do use this model remember that its only needed if you can’t use linear regression. Report the ajusted R squared value, the estimated degrees of freedom and the p-value for the smooth term (not the intercept). If the spline is significant you must include the figure in your report, as that is the only way to show the shape of the response.