Chapter 18 Regression
18.1 Data
In this data set there are two numerical variables. So we can run a linear regresion.
d<-read.csv("https://tinyurl.com/aqm-data/mussels.csv")
dt(d)
18.2 Scatterplot without fitted line
g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g0+geom_point()
18.3 Scatterplot with fitted line and labels
Type the text you want for the x and y axes to replace the variable names
g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g1<-g0+geom_point() + geom_smooth(method="lm")
g1 + xlab("Some text for the x asis") + ylab("Some text for the y axis")
18.4 Fitting a model
Change the names of the variables in the first line.
mod<-lm(data= d, BTVolume~Lshell)
18.4.1 Model summary
summary(mod)
##
## Call:
## lm(formula = BTVolume ~ Lshell, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.828 -2.672 0.147 2.235 17.404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.02385 3.33917 -10.79 <2e-16 ***
## Lshell 0.59754 0.03096 19.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.864 on 111 degrees of freedom
## Multiple R-squared: 0.7704, Adjusted R-squared: 0.7684
## F-statistic: 372.5 on 1 and 111 DF, p-value: < 2.2e-16
18.4.2 Model anova table
anova(mod)
## Analysis of Variance Table
##
## Response: BTVolume
## Df Sum Sq Mean Sq F value Pr(>F)
## Lshell 1 8811.4 8811.4 372.49 < 2.2e-16 ***
## Residuals 111 2625.7 23.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
18.4.3 Confidence intervals for the model parameters
confint(mod)
## 2.5 % 97.5 %
## (Intercept) -42.6406346 -29.4070662
## Lshell 0.5361881 0.6588891
18.4.4 Extracting residuals
d$residuals<-residuals(mod)
18.4.5 Model diagnostics
Look at the regression handout to understand these plots.
plot(mod,which=1)
plot(mod,which=2)
plot(mod,which=3)
plot(mod,which=4)
plot(mod,which=5)
18.4.6 Spearman’s rank correlation
Used if all else fails. Not needed with these data, but included for reference.
g0<-ggplot(d,aes(x=rank(Lshell),y=rank(BTVolume)))
g0+geom_point() + geom_smooth(method="lm")
cor.test(d$Lshell,d$BTVolume,method="spearman")
##
## Spearman's rank correlation rho
##
## data: d$Lshell and d$BTVolume
## S = 26143, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8912809
18.5 Fitting a spline
Only use if you suspect that the relationship is not well described by a straight line.
library(mgcv)
g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g1<-g0 + geom_point() + geom_smooth(method="gam", formula =y~s(x))
g1 + xlab("Some text for the x asis") + ylab("Some text for the y axis")
In this case the line is the same as the linear model. Get a summary using this code.
mod<-gam(data=d, BTVolume~s(Lshell))
summary(mod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## BTVolume ~ s(Lshell)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.8142 0.4557 61.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Lshell) 1.493 1.847 198.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.77 Deviance explained = 77.3%
## GCV = 23.993 Scale est. = 23.463 n = 113
If you do use this model remember that its only needed if you can’t use linear regression. Report the ajusted R squared value, the estimated degrees of freedom and the p-value for the smooth term (not the intercept). You must include the figure in your report, as that is the only way to show the shape of the response.