Chapter 4 Regression

4.1 Data

In this data set there are two numerical variables. So we can run a linear regresion.

d<-read.csv("https://tinyurl.com/aqm-data/mussels.csv")
dt(d)

4.2 Scatterplot without fitted line

g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g0+geom_point()

4.3 Scatterplot with fitted line and labels

Type the text you want for the x and y axes to replace the variable names

g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g1<-g0+geom_point() + geom_smooth(method="lm") 
g1 + xlab("Some text for the x asis") + ylab("Some text for the y axis")

4.4 Fitting a model

Change the names of the variables in the first line.

mod<-lm(data= d, BTVolume~Lshell)

4.4.1 Model summary

summary(mod)
## 
## Call:
## lm(formula = BTVolume ~ Lshell, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.828  -2.672   0.147   2.235  17.404 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.02385    3.33917  -10.79   <2e-16 ***
## Lshell        0.59754    0.03096   19.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.864 on 111 degrees of freedom
## Multiple R-squared:  0.7704, Adjusted R-squared:  0.7684 
## F-statistic: 372.5 on 1 and 111 DF,  p-value: < 2.2e-16

4.4.2 Model anova table

anova(mod)
## Analysis of Variance Table
## 
## Response: BTVolume
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Lshell      1 8811.4  8811.4  372.49 < 2.2e-16 ***
## Residuals 111 2625.7    23.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.4.3 Confidence intervals for the model parameters

confint(mod)
##                   2.5 %      97.5 %
## (Intercept) -42.6406346 -29.4070662
## Lshell        0.5361881   0.6588891

4.4.4 Extracting residuals

d$residuals<-residuals(mod)

4.4.5 Model diagnostics

Look at the regression handout to understand these plots.

plot(mod,which=1)

plot(mod,which=2)

plot(mod,which=3)

plot(mod,which=4)

plot(mod,which=5)

4.4.6 Spearman’s rank correlation

Used if all else fails. Not needed with these data, but included for reference.

g0<-ggplot(d,aes(x=rank(Lshell),y=rank(BTVolume)))
g0+geom_point() + geom_smooth(method="lm") 

cor.test(d$Lshell,d$BTVolume,method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  d$Lshell and d$BTVolume
## S = 26143, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8912809

4.5 Fitting a spline

Only use if you suspect that the relationship is not well described by a straight line.

library(mgcv)

g0<-ggplot(d,aes(x=Lshell,y=BTVolume))
g1<-g0 + geom_point() + geom_smooth(method="gam", formula =y~s(x))
g1 + xlab("Some text for the x asis") + ylab("Some text for the y axis")

In this case the line is the same as the linear model. Get a summary using this code.

mod<-gam(data=d, BTVolume~s(Lshell))
summary(mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## BTVolume ~ s(Lshell)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.8142     0.4557   61.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df     F p-value    
## s(Lshell) 1.493  1.847 198.8  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.77   Deviance explained = 77.3%
## GCV = 23.993  Scale est. = 23.463    n = 113

If you do use this model remember that its only needed if you can’t use linear regression. Report the ajusted R squared value, the estimated degrees of freedom and the p-value for the smooth term (not the intercept). You must include the figure in your report, as that is the only way to show the shape of the response.