Chapter 9 Introduction to grammar of graphics plots
In order to produce figures that can be really honed up to publication quality we will use the package ggplots2.
Grammar of Graphics plots (ggplots) were designed by Hadley Wickam, who also programmed dplyr. This is “tidy” approach to programming that is more powerful than base R. The syntax differs from base R syntax in various ways. It can take some time to get used to. However ggplots provides an extremely elegant framework for building really nice looking figures with comparatively few lines of code.
9.1 Simple bar charts
The simplest figures of all are probably barcharts.
Barcharts are used when the data consist of counts or percentage. Sometimes barcharts have been used to show means. These sort of barcharts with confidence intervals are inferential and are also known as “dynamite charts.” Although they are still used in some publications dynamite plots are best avoided. We will retun to this issue, but you may want to have a quick look at http://emdbolker.wikidot.com/blog:dynamite
Genuine barcharts are non inferential. In other words no statistical tests are associated with them directly. They are simply used to display information.
Let’s look at the simplest case possible where barcharts may be used. At the end of Spetember 2019 a group of Bournemouth University students monitored the traffic on the roads around the campus. They provided data on the counts of different types of vehicles passing each hour. Here is their data as they summarised it.
These very simple data have no replication at all. Therefore there is no possibility of conducting any form of inferential statistical analysis. Sometimes a Chi Squared goodness of fit test can be used to test whether there is a significant difference between counts in each group, but this would clearly not be appropriate in this case as the differences are very apparent. However we can show the results to the reader more clearly through a figure than a table.
To form a standard barchart in gglots we first need to decide how the data will be mapped onto the elements that make up the plot. The term for this in ggplot speak is “aesthetics”- Personally I find the term aestehetics to represent mapping a bit misleading. I would instinctively assume that aesthetics refers to the colour scheme or other visual aspect of the final plot. In fact the aesthetics are the first thing to decide on, rather than the last.
The way to build a ggplot is by first forming an invisible object which represents the mapping of data onto the page. The way these data are presented can then be changed through adding differnt geometries. The only aesthetics (mappings) that you need to know about for basic usage are x,y, colour, fill, group and label. The x and y mappings coincide with axes, so are simple enough. Remember that a histogram maps onto the x axis. The y axis shows either frequency or density so is not mapped directly as a variable in the data.
The variable on the x axis in this case would be the Vehicle type. The count of the number of vehicles would be placed on the y axis. If we want to label the figure with the actual number counted (which is good practice for barcharts if feasible) we could add a label aesthetic as well.
Now let’s plot out the barplot. We do that by adding a geometry. There are two geometries that could be used heere. The simplest in this case is to use geom_col and add the label. We’ll assign the result to g1, then type the name g1 to plot it. That way if we want to continue modifying our plot we just add to g1.
Note that if geom_bar is used then you need to tell R to use the “identity” stat if a table of counts is used. The default stat in this case is count, i.e. geom_bar itself will form a count table from raw data.
This figure looks OK, but it would be much clearer to have the bars in a ranked order.
To do this in R we use the follwing code. First tell R to arrange the data frame according to Count. To place in ascending order use -Count. Then a little trick is used to relevel the Vehicle factor levels to match the arrangement.
We are going to have to set up the aesthetics again to use this, as the data have changed.
By default the background used for the first figure was grey. We can also change the look and feel of subsequent plots by setting the theme. A black and white theme might be better for printing.
To show the results as percentages we can calculate them using another line of dplyr code.
9.1.1 Exercise
- The following code chunk produces the number of votes cast for each party in Bournemouth west.
Form barcharts using both counts and percentages.
- The following code simulates responses to a question on the Likert scale
Form a barchart using these data.
Note that analysing a full questionaire would involve looking at the responses to many questions simultaneously.R has some additional graphical tools for this
9.2 Simple Line charts
Do not confuse genuine line charts with scatterplots with a fitted line. Line charts are usually non inferential figures (i.e. they do not show confidence intervals). A common use for a line chart is to show a time series.
We’ll extract a portion of data on rainfall from a larger data set. The code below produces a small data frame of annual rainfall measured at the Hurn meteorological station.
We’ll set up the aesthetics. The x axis is the year, the rainfall goes on the y axis. We might want to use the number as a label.
Now, adding geom_line produces a basic line chart.
This might look better if the actual numbers are shown. This only works for short time series with few values. If there are more values the plot would look too cluttered.
Now we might want to add a title and some text for the x and y axes.
This looks better, but its a bit difficult to see which year the number applies to. We can set the breaks on the continyuos x axis with scale_x_continuous(breaks = )
A final touch might be to rotate the text on xaxis.
9.2.1 Exercise
These lines of code produce a data frame with the mean annual temperature.
Form a line chart of the mean annual temperatures at Hurn.
9.3 Dynamic graphs
There are a growing number of packages for producing dynamic graphs in R for web pages. One that is very useful for time series is the dygraph package. This can produce figures with rolling averages, that average out a time series based on the last n observations. So if the rolling average is set to 12 months the chart shows the last 12 months.