library(tidyverse)
library(aqm)
library(plotly)

Read in the data

data(sleep)
d<-sleep

If you read in the data using read_csv you would find that the Species and Diet variables are considered to be of the class “character”. This behaviour differs from the older read.csv function.

For most analyses it is preferable to set the variables to be factors. The code below does this.

d$Diet<- as.factor(d$Diet)
d$Species<- as.factor(d$Species)

Now we already know that these variables need log transforming.

d$LogBrainWt <- log10(d$BrainWt)
d$LogBodyWt <- log10(d$BodyWt)

Using ggplot for visualising regression

g0 <- ggplot(d, aes(x=LogBodyWt, d$LogBrainWt, label=Species)) # set up the aesthetics
g1 <- g0 + geom_point() + geom_smooth(method="lm") # Add the geometries
g1

We can label the points, but this is going to look very messy.

g1 + geom_label()

Using plotly for interactive data exploration

Although this cannot be used in a publication, interactive data exploration using plotly is the easiest way to solve this problem when you are exploring the data yourself.

ggplotly(g1)

Fitting the regression

We can fit the regression in the usual manner.

mod <- lm(data=d, LogBrainWt~LogBodyWt)

Note the warning about not being interested in elements such as p-values. Although we can ask R for a summary of the model we have no interest in the significance. The \(R^2\) may be of some limited interest. However the real reason for fitting the model is to obtain the residuals in this case.

summary(mod)
## 
## Call:
## lm(formula = LogBrainWt ~ LogBodyWt, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74953 -0.22133 -0.02063  0.18531  0.82617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.92489    0.04670   19.80   <2e-16 ***
## LogBodyWt    0.76407    0.03153   24.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3168 on 51 degrees of freedom
## Multiple R-squared:  0.9201, Adjusted R-squared:  0.9185 
## F-statistic: 587.2 on 1 and 51 DF,  p-value: < 2.2e-16

Obtaining the residuals

The residuals can be used in additional analyses as an indicator of brain weight after controlling for body weight. Note that they remain on a logarithmic scale.

d$brain_index <- residuals(mod)

Some tricks to plot the residuals

The fct_reorder function that is part of the tidyverse can be handy for reordering a factor.

d$Species <- fct_reorder(d$Species, d$brain_index, min)

Now we can get a rather “busy” but useful visualisation of the residuals. Notice the use of geom_col to produce columns. I.e. we get a barplot. The bars can be coloured by diet for added effect. The default colours are not very nice and can be changed quite easily, but I will leave it there for now.

g1<-ggplot(d, aes(x=Species,y=brain_index, label=Species, fill=Diet)) + geom_col() +  coord_flip()
g1