Chapter 4 Simple plots and figures for exploratory analysis

library(tidyverse)

There are many different ways of making figures in R. The gglots package has become the standard for presentations and publications. This produces modern, professional looking figures. For simple data exploration the original base graphs that form part of the underlying R language are useful.

These sort of figures use syntax that is easy to remember and quick to type. The tend to work on only one or two variables at once. The results are not usually very elegant, but they do help to quickly explore the data.

4.0.1 Histograms and boxplots

4.0.2 Video 9: Forming a simple histogram

The code for forming a very simple exploratory histogram is very simple.

Notice the way the column that you want to plot is specified by the name of the data frame (d) followed by a dollar sign then the name of the column. This is standard R syntax.
To build your own analysis in your own markdown document you can either carefully copy and paste code from other sources (these handouts for example) or you can write your own code. Be very careful to always run code chunks in order. If, for example, you had not loaded in the data frame earlier in the analysis and tried to run the code you would get a an error of the form Error in hist(x) : object ‘x’ not found

d<-read_csv("sleep.csv")
hist(d$BodyWt)

4.0.3 Video 9: Copying and pasting code into a markdown document.

Video 1

4.0.4 Video 10: Running a plot command in the console

Video 1

4.0.5 Video 11: Producing a boxplot and spotting skew

boxplot(d$BodyWt)

These plots show up a very clear statistical feature of this variable, which you could also have seen in the summary statistics. The variable is very strongly right skewed.

Video 1

4.0.6 Video 11: Log transforming a variable

When we are working with data we often want to produce new columns containing new variables. R has many thousands of functions for working with data. If you want to carry out any mathematical operation at all on your data you can do it in R. The difficulty may be knowing which operation you require and finding the right syntax. In the case of a logarithmic transform it is fairly simple. However R uses log to refer to natural logarithms (other software tends to use ln). If we want logarithms to the base 10 we need log10.

The assignment operator in R is an arrow. You can think of this as taking the values on the right and sending them into the object on the left. If the variable does not yet exist it will be created.

d$logBodyWt<-log10(d$BodyWt)

The result of taking logarithms. The new variable has much better statistical properties.

hist(d$logBodyWt)

Video 1

4.1 Plotting two variables

Base R is able to adapt a single command to plot two variables according to the nature of the variables involved. So if we plot a (perhaps suitably transformed) variable against a factor we will get boxplots.

When we read the data in using read_csv and asked for the structure R told us that Diet was a character vector. In order to carry out analysis on such data we have to coerce it to a factor. This is a single line of code in R.

d$Diet<-as.factor(d$Diet)

The older, base R command (not run here) uses a dot instead of an underscore and automatically carries out this step for you.

# Base R uses a dot not an underscore. read.csv not read_csv
# d<-read.csv("sleep_csv")
# Note that code that has a # before it is not run. This can be used to deactivate code that you still want to include in a document for reference purposes

However the change implemented in the readr package was made because when more complex data is read in this default behaviour can be undedesirable. As RStudio prompts you to use read_csv this had to be pointed out. It can cause confusion! Sorry about this. So do always check your data using str.

plot(d$logBodyWt ~ d$Diet)

Plotting a numerical variable against another numerical variable produces a scatterplot.

plot(d$BodyWt~d$BrainWt)

This shows up the problem with using the untronsformed variables in this case. In order to analyse these data you will need to log transform more than one variable.

4.1.1 Exercise

Make histograms of all the other variables in the data set. What do you find? Log transform the variables which may require transformation. Replot the data. Write up the exercise using annotated R code as a markdown document.