Grammar of Graphics plots (ggplots) were designed by Hadley Wickam, who also programmed dplyr. The two packages share the same aproach to high level declarative programming. The syntax differs from base R graphics syntax in various ways and can take some time to get used to. However ggplots provides an extremely elegant framework for building really nice looking figures with comparatively few lines of code. Like the older lattice plots, ggplots are quite prescriptive and will make a lot of the decisions for you, although most of the default settings can be changed.
A very useful resource is provided by the R cookbook pages.
http://www.cookbook-r.com/Graphs/
A detailed tutorial of gggplots is provided in chapter 3 of R for data science
https://r4ds.had.co.nz/data-visualisation.html
Some additional ideas are available here
http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
A flip book approach is taken here.
https://evamaerey.github.io/ggplot_flipbook/ggplot_flipbook_xaringan.html#23
Typical histograms use only one variable at a time, although they may be “conditioned” by some grouping variable. The aim of a histogram is to show the distribution of the variable clearly. Let’s try ggplot histograms in ggplot2 using some simple data on mussel shell length at six sites.
library(ggplot2)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.
# devtools::install_github("dgolicher/aqm") To install aqm package If not working on the server
library(aqm)
##
## Attaching package: 'aqm'
## The following object is masked from 'package:stats':
##
## dt
data(mussels)
d<-mussels
str(d)
## 'data.frame': 113 obs. of 3 variables:
## $ Lshell : num 122.1 100.1 100.7 102.3 94.9 ...
## $ BTVolume: int 39 21 23 22 20 22 21 18 21 15 ...
## $ Site : Factor w/ 6 levels "Site_1","Site_2",..: 6 6 6 6 6 6 6 6 6 6 ...
The first step when building a ggplot is to decide how the data will be mapped onto the elements that make up the plot. The term for this in ggplot speak is “aesthetics”- Personally I find the term rather odd and potentially misleading. I would instinctively assume that aesthetics refers to the colour scheme or other visual aspect of the final plot. In fact the aesthetics are the first thing to decide on, rather than the last.
The easiest way to build a plot is by first forming an invisible object which represents the mapping of data onto the page. The way these data are presented can then be changed. The only aesthetics (mappings) that you need to know about for basic usage are x,y, colour, fill and group. The x and y mappings coincide with axes, so are simple enough. Remember that a histogram maps onto the x axis. The y axis shows either frequency or density so is not mapped directly as a variable in the data. So to produce a histogram we need to provide ggplot2 with the name of the variable we are going to use. We do that like this.
g0 <- ggplot(data=d,aes(x=Lshell))
Once the mapping is established producing a histogram, or any other figure, involves deciding on the geometry used to depict the aesthetics. The default is very simple.
g0 + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are several things to notice here. One is that the default theme has a grey background and looks rather like some of the figures in Excel. For some purposes this can be useful. However you may prefer a more traditional black and white theme. This is easy to change. You are also warned that the default binwidth may not be ideal.
My own default settings for a histogram would therefore look more like this.
g1<-g0 +geom_histogram(fill="grey",colour="black",binwidth=10) + theme_bw()
g1
This should be self explanatory. The colour refers to the lines. It is usually a good idea to set the binwidth manually anyway. Notice that this time I have assigned the results to another object. We can then work with this to produce conditional histograms.
You can set the theme to black and white for all subsequent plots with a command.
theme_set(theme_bw())
Conditioning the data on one grouping variable is very simple using a facet_wrap. Facets are the term used in ggplots for the panels in a lattice plot. There are two facet functions. Facet_wrap simply wraps a one dimensional set of panels into what should be a convenient number of columns and rows. You can set the number of columns and rows if the results are not as you want.
g_sites<-g1+facet_wrap(~Site)
g_sites
Density plots are produced in similar manner to histograms.
g1<-g0 +geom_density(fill="grey",colour="black")
g1
g_sites<-g1+facet_wrap(~Site)
g_sites
Adding a grouping aesthetic allow subgroups to be plotted on the same figure.
g_group<-ggplot(d, aes(Lshell, group=Site)) + geom_density()
g_group
Colour and fill are also grouping aesthetics. So a nicer way of showing this would be use them instead.
g_group<-ggplot(d, aes(Lshell, colour = Site,fill=Site)) + geom_density(alpha = 0.2)
g_group
A grouped boxplot uses the grouping variable on the x axis. So we need to change the aesthetic mapping to reflect this.
g0 <- ggplot(d,aes(x=Site,y=Lshell))
g_box<-g0 + geom_boxplot(fill="grey",colour="black")+theme_bw()
g_box
You should be able to work out that sets of boxplots could be conditioned on a third variable using faceting.
Species<-as.factor(rep(c("Sp_1","Sp_2"),each=20))
Site<-rep(c("Site_1","Site_2"),times=20)
resp<-rnorm(40,10,4)
d2<-data.frame(Species,resp,Site)
g_box<-ggplot(d2,aes(y=resp,x=Species))+geom_boxplot(fill="grey",colour="black")
g_box
g_box+facet_wrap(~Site)
One important rule that you should try to follow when presenting data and carrying out any statistical test is to show confidence intervals for key parameters. Remember that boxplots show the actual data. Parameters extracted from the data are means when the data are grouped by a factor. When two or more numerical variables are combined the parameters refer to the statistical model, as in the case of regression.
Grammar of graphics provides a convenient way of adding statistical summaries to the figures. We can show the position of the mean mussel shell length for each site simply by asking to plot the mean for each y like this.
g0 <- ggplot(d,aes(x=Site,y=Lshell))
g_mean<-g0+stat_summary(fun.y=mean,geom="point")
g_mean
This is not very useful. However you can easily add confidence intervals
g_mean+stat_summary(fun.data=mean_cl_normal,geom="errorbar")
If you want more robust confidence intervals with no assumption of normality of the residulas then you can use bootstrapping. In this case the result should be just about identical, as the errors are approximately normally distributed.
g_mean+stat_summary(fun.data=mean_cl_boot,geom="errorbar")
The traditional “dynamite” plots with a confidence interval over a bar can be formed in the same way.
g0 <- ggplot(d,aes(x=Site,y=Lshell))
g_mean<-g0+stat_summary(fun.y=mean,geom="bar")
g_mean+stat_summary(fun.data=mean_cl_normal,geom="errorbar")
Most statisticians prefer that the means are shown as points rather than bars. You may want to look at the discussion on this provided by Ben Bolker.
One way to infer differences between medians is to plot boxplots with notches.
g0 <- ggplot(d,aes(x=Site,y=Lshell))
g_box<-g0 + geom_boxplot(fill="grey",colour="black", notch=TRUE)+theme_bw()
g_box
# Function included in aqm package
# median_cl_boot <- function(x, conf = 0.95) {
# lconf <- (1 - conf)/2
# uconf <- 1 - lconf
# require(boot)
# bmedian <- function(x, ind) median(x[ind])
# bt <- boot(x, bmedian, 1000)
# bb <- boot.ci(bt, type = "perc")
# data.frame(y = median(x), ymin = quantile(bt$t, lconf), ymax = quantile(bt$t,
# uconf))
# }
g0+ stat_summary(fun.data = aqm::median_cl_boot, geom = "errorbar") + stat_summary(fun.y = median, geom = "point")
Scatterplots can be built up in a similar manner. We first need to define the aesthetics. In this case there are clearly a and y coordinates that need to be mapped to the names of the variables.
g0 <- ggplot(d,aes(x=Lshell,y=BTVolume))
g0+geom_point()
It is very easy to add a regression line with confidence intervals to the plot.
g0+geom_point()+geom_smooth(method = "lm", se = TRUE)
Although the syntax reads “se=TRUE” this refers to 95% confidence intervals.
There are various ways of plotting regressions for each site. We could define a colour aesthetic that will automatically group the data and then plot all the lines as one figure.
g0 <- ggplot(d,aes(x=Lshell,y=BTVolume,colour=Site))
g1<-g0+geom_point()+geom_smooth(method = "lm", se = TRUE)
g2<-g1+geom_point(aes(col=factor(Site)))
g2
This can be split into panels using facet wrapping.
g2+facet_wrap(~Site)
Many models involving two variables can be visualised using ggplot.
d<-read.csv(system.file("extdata", "marineinverts.csv", package = "aqm"))
str(d)
## 'data.frame': 45 obs. of 4 variables:
## $ richness: int 0 2 8 13 17 10 10 9 19 8 ...
## $ grain : num 450 370 192 194 197 ...
## $ height : num 2.255 0.865 1.19 -1.336 -1.334 ...
## $ salinity: num 27.1 27.1 29.6 29.4 29.6 29.4 29.4 29.6 29.6 29.6 ...
g0<-ggplot(d,aes(x=grain,y=richness))
g1<-g0+geom_point()+geom_smooth(method="lm", se=TRUE)
g1
g2<-g0+geom_point()+geom_smooth(method="lm",formula=y~x+I(x^2), se=TRUE)
g2
g3<-g0+geom_point()+geom_smooth(method="loess", se=TRUE)
g3
You can also use a gam directly.
g4<-g0+geom_point()+stat_smooth(method = "gam", formula = y ~ s(x))
g4
The plots can be arranged on a single page using the multiplot function taken from a cookbook for R.
multiplot(g1,g2,g3,g4,cols=2)
Ggplots conveniently show the results of prediction from a generalised linear model on the response scale.
glm1<-g0+geom_point()+geom_smooth(method="glm", method.args=list(family="poisson"), se=TRUE) +ggtitle("Poisson")
glm1
glm2<-g0+geom_point()+geom_smooth(method="glm", method.args=list(family="quasipoisson"), se=TRUE) + ggtitle("Quasipoisson")
glm2
library(MASS)
glm3<-g0+geom_point()+geom_smooth(method="glm.nb", se=TRUE) +ggtitle("Negative binomial")
glm3
multiplot(glm1,glm2,glm3,cols=2)
The same approach can be taken to binomial data.
ragworm<-read.csv(system.file("extdata", "ragworm_test3.csv", package = "aqm"))
str(ragworm)
## 'data.frame': 100 obs. of 2 variables:
## $ presence: int 1 1 0 0 1 1 1 0 1 1 ...
## $ salinity: num 2.55 2.21 3.39 2.96 1.88 ...
g0 <- ggplot(ragworm,aes(x=salinity,y=presence))
g1<-g0+geom_point()+stat_smooth(method="glm",formula = y~ x, method.args=list(family="binomial"))+ggtitle("Linear")
g2<-g0+geom_point()+stat_smooth(method="glm",formula = y~ x+I(x^2), method.args=list(family="binomial"))+ggtitle("Polynomial")
g3<-g0+geom_point()+stat_smooth(method = "gam", formula = y ~ s(x), method.args=list(family="binomial"))+ggtitle("GAM")
multiplot(g1,g2,g3,cols=2)