During the course of this unit you will develop the skills to be able to apply R code to the analysis of a range of data sets. Although crib sheets will be provided with model code for all the analytical techniques shown on the course, in order to manipulate data effectively and apply novel techniques it is important to begin to become more comfortable with basic R programming techniques and concepts. This week we will revise some of the fundamentals of R programming.

Note that dplyr and the “tidyverse” set of functions now make programming large data sets much more agile and elegant than would be the case if we only used the original base R language. However many of the functions that you will find in packages are written using base R. In order to understand other people’s code you need some familiarity with the standard R language.

Using base R to simulate data

If you understand data structures you can usually choose and run the right analysis. If you do not consider the basic properties of your data then not only will you not know which analysis to use, but you will probably not even be able to import the data into any software in order to run it!

Some very simple R commands

If you are still completely new to R try running some simple commands by stepping through them in code chunks. When working in RStudio notice that the results are shown under the code chunk when you click on the green arrrow. If you are comfortable with basic R then skip this section.

1+1
## [1] 2

Experiment some more. For example ..

5*10
## [1] 50
12/4
## [1] 3
3^2
## [1] 9

There is a lot more maths functionality built into the R language that you can find out about as you go on. However to follow the primer you do not need to learn any more commands than are shown in this document.

Note that when you type maths in the console the output is just printed to the screen. This is not stored anywhere in the computer’s memory. So we need to bring in the concept of data objects. A data object “holds” the numbers and can be manipulated by commands. This is how R manages to run analyses on your data. When you import a file containing data it will be stored into a data object.

The simplest data object is a variable. If a variable only contains one number it is known formally as a scalar. A variable with many numbers is a vector. So let’s assign a single value to a variable to form a scalar. We do that in R using the <- operator which is typed as a combination of the less than sign < with a horizontal bar -.

x<-5

Nothing appears to have happened! However you now have your first data object, a scalar called x. This is stored in memory. So try this.

x*2
## [1] 10

Notice that this would not have worked if you had typed X*2 as R is case sensitive.

So, how do we form a vector with multiple values? When you are analysing your own data the answer is that you usually won’t need to. You import all the values from a file. But if you wish to form a vector in the console you must use the concatenation operator “c”. So this gives the variable x a series of values.

x<-c(1,4,5,9,10,11,12,34,56,67,100,123,45)

Now see what happens if you type

x*2
##  [1]   2   8  10  18  20  22  24  68 112 134 200 246  90

You can carry out any sort of mathematical operation that involves x and all the values in the vector will be used. Notice that the results are just printed out to the console and lost.

If you want to assign the results to a new variable you use the “<-” operator again. This is a very common practice when analysing data. So, say you are intending to work with the natural logarithm of x you might write.

logx<-log(x)

You can see that this has worked by writing the new variable name so that R prints out the contents to the console.

logx
##  [1] 0.000000 1.386294 1.609438 2.197225 2.302585 2.397895 2.484907 3.526361
##  [9] 4.025352 4.204693 4.605170 4.812184 3.806662

This time you can see more clearly the purpose of the indices in the output. The second line starts with the 12th number in the vector.

You can find the names of the data objects that are held in memory by typing ls(). In Rstudio it is easier to look in the Environment tab in order to see the objects in working memory.

ls()
## [1] "logx" "x"

Data structures

Now we can begin looking at data structures. You can ask R to tell you about the structure of any object that it has in memory by typing str().

str(x)
##  num [1:13] 1 4 5 9 10 11 12 34 56 67 ...
str(logx)
##  num [1:13] 0 1.39 1.61 2.2 2.3 ...

So R has told us that both x and logx are numerical vectors. If you have a lot of data you probably do not want to look at all of it at once.

How does this relate to choosing an analysis? We have seen that this particular variable contains numbers. However in statistical analysis we also use “variables”" that don’t consist of numbers. They vary in another respect. If you look at any introductory statistical textbook you will see a chapter that defines types of variables in terms of interval, ordinal and scale variables and nominal and ordinal variables.

More vector functions

As we have seen functions act on the whole vector at once. This applies also to functions that return only one number as a result.

x<-c(1,3,5,7,4,2,7)
sum(x)
## [1] 29
mean(x)
## [1] 4.142857
length(x)
## [1] 7
var(x)
## [1] 5.47619
sd(x)
## [1] 2.340126

It is very important to realise that a vector with NA values can cause problems for these sorts of functions. You need to tell R explicitly to remove the NAs. If not the result itself will be an NA. For example.

x<-c(1,2,3,4,5,6)
mean(x)
## [1] 3.5
x<-c(1,2,3,4,5,NA)
mean(x)
## [1] NA
mean(x,na.rm=T)
## [1] 3

This is a very common pitfall for beginners. In general, if something does not work as you expect look for NAs!

Generating sequences of numbers in R

One of the most useful features of R is the ease with which you can generate sequences of numbers and simulated data sets. To produce a sequence you can use the : syntax.

x<-0:100
0:100
##   [1]   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [19]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
##  [37]  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
##  [55]  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
##  [73]  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
##  [91]  90  91  92  93  94  95  96  97  98  99 100
x<-30:10
x
##  [1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10

A longer but more flexible way of producing a sequence is with seq. For example to produce even numbers between 0 and 100.

x<-seq(0,100,by=2)
x
##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36
## [20]  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74
## [39]  76  78  80  82  84  86  88  90  92  94  96  98 100

Say we know the length of the sequence we want but have not worked out the intervals.

x<-seq(0,100,length=23) 
x
##  [1]   0.000000   4.545455   9.090909  13.636364  18.181818  22.727273
##  [7]  27.272727  31.818182  36.363636  40.909091  45.454545  50.000000
## [13]  54.545455  59.090909  63.636364  68.181818  72.727273  77.272727
## [19]  81.818182  86.363636  90.909091  95.454545 100.000000

Using rep to replicate a vector

If we want ten copies of the same vector one after the other we can use.

x<-rep(c(1,4,9,23),times=10) 
x
##  [1]  1  4  9 23  1  4  9 23  1  4  9 23  1  4  9 23  1  4  9 23  1  4  9 23  1
## [26]  4  9 23  1  4  9 23  1  4  9 23  1  4  9 23

However we might want each number in the vector replicated ten times before moving to the next. In this case we use each instead of times.

x<-rep(c(1,4,9,23),each=10)
x
##  [1]  1  1  1  1  1  1  1  1  1  1  4  4  4  4  4  4  4  4  4  4  9  9  9  9  9
## [26]  9  9  9  9  9 23 23 23 23 23 23 23 23 23 23

Logical vectors and subsetting

One of the keys to using base R efficiently is the concept of logical vectors, indices and subsetting. However the concept does take a little effort to get used to. Let’s take it a step at a time. Say we have a vector x which we have setup like this.

x<-seq(-4,10,by=2)
x<-rep(x,times=3)
x
##  [1] -4 -2  0  2  4  6  8 10 -4 -2  0  2  4  6  8 10 -4 -2  0  2  4  6  8 10

Now we can ask a question that will be answered as true or false for each of the numbers. Is the element of x greater than zero?

x>0
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

R replies with a vector stating whether the result is true or false. We can also ask which of the numbers is greater than zero.

which(x>0)
##  [1]  4  5  6  7  8 12 13 14 15 16 20 21 22 23 24

Now R replies with the indices of the elements of the vector.

Subsetting the vector involves the use of the square brackets {[} and {]}. If you include either a logical vector with TRUE and FALSE values or numeric indices within the square brackets R will subset the vector. Either way works.

x[x>0]
##  [1]  2  4  6  8 10  2  4  6  8 10  2  4  6  8 10
x[which(x>0)]
##  [1]  2  4  6  8 10  2  4  6  8 10  2  4  6  8 10

When we move on to handling more than one variable at a time using data frames we will see that the same concept can be used to subset whole blocks of data.It is a very powerful and fundamentally simple way of manipulating data.

A more complex example is given here. This takes every second member of x using a sequence of indices as a subset.

x[seq(2,length(x),by=2)]
##  [1] -2  2  6 10 -2  2  6 10 -2  2  6 10

Could you follow how this worked? Working outwards

  • length(x) gives the number of elements in x. Lets call this n
  • seq(0,length(x),by=2) gives the vector of indices 2,4,6,….n.
  • A subset of x is found using the square brackets.

With experience it is common to wrap up several steps in a single line. There is nothing wrong with explicitly writing

n<-length(x)
ind<-seq(2,n,by=2)
x[ind]
##  [1] -2  2  6 10 -2  2  6 10 -2  2  6 10

Sorting and ordering

We can also sort vectors. There are two ways of doing this. The first is very direct, but I really do not recommend it. It uses the sort command with the argument decreasing=TRUE or FALSE

sort(x,decreasing=T)
##  [1] 10 10 10  8  8  8  6  6  6  4  4  4  2  2  2  0  0  0 -2 -2 -2 -4 -4 -4

That’s simple enough. However it is much better practice in the long run to use order. This needs a little explaining, again we will take it step by step.

order(x,decreasing=T)
##  [1]  8 16 24  7 15 23  6 14 22  5 13 21  4 12 20  3 11 19  2 10 18  1  9 17

Order has not given us the numbers in order! But of course it should not, as sort does that. Instead order has given us the the indices in order. Notice that if there are ties, as in this case, R respects the original order for the tied numbers. So how can we use order to sort the vector? If you have followed the logic of using indices to refer to elements of a vector you might have guessed.

x[order(x,decreasing=T)]
##  [1] 10 10 10  8  8  8  6  6  6  4  4  4  2  2  2  0  0  0 -2 -2 -2 -4 -4 -4

Although this involves more typing than simply writing sort, it is more powerful. The power comes from the way indices can be used with many variables at once. When we move on to see data frames this will be clearer. For the moment look at this simple example to see the logic.

x<-c(1,3,2,4)
y<-c("A","B","C","D")
y[order(x,decreasing=F)]
## [1] "A" "C" "B" "D"

Ranks

Finding ranks is a very common procedure in statistics. This is a slightly different problem. We need a way to deal with ties, and there is more than one solution to the problem.

x<-c(1,1,2,3,4,5,5)
rank(x,ties="average")
## [1] 1.5 1.5 3.0 4.0 5.0 6.5 6.5
rank(x,ties="max")
## [1] 2 2 3 4 5 7 7
rank(x,ties="min")
## [1] 1 1 3 4 5 6 6

The last example coincides with the familiar ranks given in sports (joint gold medal followed by the bronze). Notice that there is no decreasing argument to ranks. The lowest numbers take the lowest ranks. If you really want to rank performance scores you must reverse them first by, say, subtracting all the scores from the maximum possible.

x<-c(1,1,2,3,4,5,5)
rank(max(x)-x,ties="min") 
## [1] 6 6 5 4 3 1 1

Replicating text

When designing a format to hold the results from a planned experiment or field survey it can be very useful to generate replicated sequences of text for the factor levels or grouping variables. This is also very easy.

x<-rep(c("Control","Treatment1","Treatment2"),each=10)
x
##  [1] "Control"    "Control"    "Control"    "Control"    "Control"   
##  [6] "Control"    "Control"    "Control"    "Control"    "Control"   
## [11] "Treatment1" "Treatment1" "Treatment1" "Treatment1" "Treatment1"
## [16] "Treatment1" "Treatment1" "Treatment1" "Treatment1" "Treatment1"
## [21] "Treatment2" "Treatment2" "Treatment2" "Treatment2" "Treatment2"
## [26] "Treatment2" "Treatment2" "Treatment2" "Treatment2" "Treatment2"

Or of course

x<-rep(c("Control","Treatment1","Treatment2"),times=10)
x
##  [1] "Control"    "Treatment1" "Treatment2" "Control"    "Treatment1"
##  [6] "Treatment2" "Control"    "Treatment1" "Treatment2" "Control"   
## [11] "Treatment1" "Treatment2" "Control"    "Treatment1" "Treatment2"
## [16] "Control"    "Treatment1" "Treatment2" "Control"    "Treatment1"
## [21] "Treatment2" "Control"    "Treatment1" "Treatment2" "Control"   
## [26] "Treatment1" "Treatment2" "Control"    "Treatment1" "Treatment2"

Simulating from known distributions

R can simulate data from a very wide range of statistical distributions. For example, to obtain data from a normal distribution with known mean and standard deviation

x<-rnorm(100,mean=100,sd=5)

The base R histogram is a quick way of visualising this.

hist(x)

Count data can be simulated using a poisson distribution

x<-rpois(100,lambda=2)
hist(x)

Right skewed data can be simulated using a log normal distribution.

x<-rlnorm(10000,2,0.5)
hist(x)

Bounded data can be simulated using a beta distribution

hist(rbeta(100,5, 5))

A simple simulated data set

Let’s simulate a simple measurement based survey. Say we have decided to look at the difference between heights of male and female students at BU. Our simple survey will be based on a random sample of ten male and ten female students. Let’s make up some data that will have properties that are similar to the empirical data that we might obtain.

We will first set up a categorical variable.

gender<-rep(c("Male","Female"),each=10)
gender
##  [1] "Male"   "Male"   "Male"   "Male"   "Male"   "Male"   "Male"   "Male"  
##  [9] "Male"   "Male"   "Female" "Female" "Female" "Female" "Female" "Female"
## [17] "Female" "Female" "Female" "Female"

If we ask R about the data structure it will tell us that we have a character vector.

str(gender)
##  chr [1:20] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" ...

In statistics categorical variables are referred to as factors. Factors are special character vectors with numbered levels. R automatically assumes that any column in a data file that contains non numeric values is a factor and converts the data to that form. In this case we need to tell R to turn the character vector into a factor.

gender<-as.factor(gender)
str(gender)
##  Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...

Now R tells us that gender is a factor with two levels. There is no intrinsic order to the levels. R places them in alphabetical order by default. The important element to be aware of is that gender is a variable. It varies in a known and specific way (it has two levels), but it is a variable all the same. You cannot calculate means, medians, standard deviations or any similar summary statistics using a factor. However you can, and will, use them together with numerical variables in many types of analysis.

Let’s produce a numerical variable to go alongside this. In the UK the mean height of men is around 176 cm and women 164 cm. So we could produce a vector with these values using this code. What we need to do is to replicate the “expected value” for each gender to be placed alongside the factor level.

height<-rep(c(176,164),each=10)
height
##  [1] 176 176 176 176 176 176 176 176 176 176 164 164 164 164 164 164 164 164 164
## [20] 164
str(height)
##  num [1:20] 176 176 176 176 176 176 176 176 176 176 ...

So now we have another numerical variable. However if we really carried out a survey of people’s heights we would be absolutely amazed if we got data like this. Although the numbers represent an estimate of the expected value for each of the ten men and women we know from experience that people’s heights vary around this value. In fact they vary quite a lot. The standard deviation is around 6cm. So we need to add some variability. This demonstrates clearly how statistical procedures work. Data consists of two components. It has an underlying structure that we typically want to find out about. In this case there is a difference in heights which is related to gender. There is also random variability, somtimes called the stochastic component. We are usually less interested in this directly, but we need to be very careful to make justifiable assumptions about this variability in order to correctly analyse the data.

Knowing this we can now make our variable more realistic by adding in some simulated values taken from a normal distribution with this standard deviation.

set.seed(1)
height<-height+rnorm(20,sd=6)
height
##  [1] 172.2413 177.1019 170.9862 185.5717 177.9770 171.0772 178.9246 180.4299
##  [9] 179.4547 174.1677 173.0707 166.3391 160.2726 150.7118 170.7496 163.7304
## [17] 163.9029 169.6630 168.9273 167.5634

We may want to round the values to one decimal place to make them equivalent to the sort of measurements we might make in practice.

height<-round(height,1)

Now we have two variables that are held in R’s memory. We are assuming that they both form part of a simulated data set that we could have obtained if we had measured a stratified random sample of twenty students consisting of ten men and ten women. Let’s call the survey “hsurvey”" and make what is known as a data frame to hold the results.

hsurvey<-data.frame(gender,height)
str(hsurvey)
## 'data.frame':    20 obs. of  2 variables:
##  $ gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ height: num  172 177 171 186 178 ...

So we have now made up data that has the same structure and similar properties to the results we would get from carrying out a simple survey of people’s heights. The basic concepts used to do this can be extended to more complex situations and used to help the design of experiments.

A data frame is the basis of almost all statistical analysis. It consists of two or more columns that line up in such a way that the measurements or observations recorded have been taken from the same individual member of a sample. This is often equivalent to a single sheet in a spreadsheet. R tells us that we have one factor and one numerical variable in this case. We might have measured many other variables at the same time and produced a much wider set of data. These may have been either categorical or numerical. Providing that they all “line up”" and correspond to the same individual we have our data frame. Many standard statistical techniques use two variables together. Later on in the course you will see how we can analyse data consisting of more than two variables using multivariate analysis.

The complete data frame is shown below.

library(dplyr)
library(aqm)
dt(hsurvey)

If you want to remove all the other variables you set up from memory and leave just the data frame you could type.

remove(x,logx,height,gender)
ls()
## [1] "hsurvey" "ind"     "n"       "y"

Now if we ask R to print height or gender it will not find them. They have been placed within the data frame.

height
## Error in eval(expr, envir, enclos): object 'height' not found
gender
## Error in eval(expr, envir, enclos): object 'gender' not found

We can refer to the variables by writing out the full name of the data frame and the variable we want, separated by “$”

hsurvey$height
##  [1] 172.2 177.1 171.0 185.6 178.0 171.1 178.9 180.4 179.5 174.2 173.1 166.3
## [13] 160.3 150.7 170.7 163.7 163.9 169.7 168.9 167.6
hsurvey$gender
##  [1] Male   Male   Male   Male   Male   Male   Male   Male   Male   Male  
## [11] Female Female Female Female Female Female Female Female Female Female
## Levels: Female Male

Saving and loading data frames

Now that we have made up some data we might want to save it in the format that we will eventually use to capture our real data. The simplest, most portable data format is a CSV (Comma Separated Variable) file. Such a file can be easily read by all software.

First we must find a place to store the data. The standard practice is to make a separate data directory (folder) for each analysis that you carry out. You then set this as the working directory using the menu options in the R console. Once you have set your working directory (which could be on a USB stick for portability) you can save the data that is held in memory using an R command.

#write.csv(hsurvey,file="hsurvey.csv",row.names=FALSE)

The command has three elements. The first is the name of the data frame that you wish to save. This is not quoted. The second is the name of the file into which you are going to save the data. The third tells R not to add row names to the file (these are not usually necessary).

Data frames are the standard input to all statistical analysis and are of a consistent form. Many different sorts of data summaries and tables can be generated from a data frame quickly by statistical software.

You can check the working directory using this command.

getwd()
## [1] "/home/rstudio/webpages/books/AQM_2024"

To see a list of files in the directory type dir()

dir()
##[1] hsurvey.csv

If we remove the dataframe we formed we will be left with nothing in R’s memory

remove(hsurvey)
ls()

The easy way to remove all files from memory is to use the environment tab in RStudio.

To load the data back from the file type

hsurvey<-read.csv("hsurvey.csv")
str(hsurvey)

Histogram of simulated data

library(ggplot2)
g0<-ggplot(hsurvey,aes(x=height))
g0+geom_histogram(fill="grey",colour="black",binwidth=3)+facet_wrap(~gender,nrow=2)

So we have simulated some data, saved it, and reloaded the data into R. We have then looked at it as a histogram.

Boxplot using ggplot

bplot<-ggplot(hsurvey,aes(x=gender,y=height))
bplot +geom_boxplot()

Confidence interval plot

bplot + 
  stat_summary(fun.y=mean,geom="point") + 
  stat_summary(fun.data=mean_cl_normal,geom="errorbar")

Summarising using dplyr

hsurvey %>% 
  group_by(gender) %>% 
  summarise(n=n(),mean=round(mean(height),2),sd=round(sd(height),2)) %>% 
  dt()

Statistical test

The simplest possible statistical test we could use on these data would be an unpaired t-test.

t.test(hsurvey$height~hsurvey$gender)
## 
##  Welch Two Sample t-test
## 
## data:  hsurvey$height by hsurvey$gender
## t = -4.4996, df = 16.471, p-value = 0.00034
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.62611  -5.99389
## sample estimates:
## mean in group Female   mean in group Male 
##               165.49               176.80

Simulating a regression

It is very easy to simulate data with the structure of a regression equation.

\(y=a+bx+\epsilon\) where \(\epsilon=N(o,\sigma^{2})\)

x<-1:10
a<-5
b<-2
sig<-4

y<-a+b*x+rnorm(length(x),sd=sig)
d<-data.frame(x,y)
ggplot(d,aes(x=x,y=y))+ geom_point() +geom_smooth(method="lm")

Simulating Likert responses

I have added a simple function to the aqm package that will simulate some Likert scale data.

q1<-aqm::rand_likert(n=100,p=c(1,1,1,2,4))
q2<-aqm::rand_likert(n=100,p=c(4,2,1,1,1))
d<-data.frame(id=1:100,q1,q2) ## Note the wide format

## Make a data frame

df<-tidyr::pivot_longer(d, cols=-id,names_to="question",values_to = "response") 
## Long format
str(d$q1)
##  Ord.factor w/ 5 levels "Strongly disagree"<..: 1 5 4 5 4 5 4 3 5 2 ...
df %>% group_by(question, response) %>% summarise (n=n())
## # A tibble: 10 × 3
## # Groups:   question [2]
##    question response              n
##    <chr>    <ord>             <int>
##  1 q1       Strongly disagree     9
##  2 q1       Disagree              8
##  3 q1       Neutral              11
##  4 q1       Agree                31
##  5 q1       Strongly agree       41
##  6 q2       Strongly disagree    44
##  7 q2       Disagree             22
##  8 q2       Neutral              10
##  9 q2       Agree                14
## 10 q2       Strongly agree       10

Simulating spatially explicit data

library(giscourse)
library(sf)
library(mapview)
conn<-connect()
arne<-sssi(dist=1000)
points<-st_sample(arne,100)
mapview(arne) + mapview(points)
disconnect()
## [1] TRUE

Exercises

In order to practice putting together these commands in order to simulate some data try these exercises

  1. A researcher is interested in looking at the differences in diameters of trees in three different woodland sites in the New Forest. At each site there are several different species. In order to simplify the task we will consider only two types of trees … conifers and broadleaves. We will also simplify the exercise by assuming that the same number of trees (50) are sampled in each woodland.

Set up a dataframe with three columns. One column represents the site. The second represents the type of tree (i.e. conifer or broadleaf). The third represents the diameters. So there will be 150 observations (rows) in all. Try to produce data in which there is a difference in mean diameter that is affected by the site from which the measurements are taken and the type of tree being measured. You can assume that the random variation in diameters is normally distributed and that measurements are taken to the nearest cm.

  1. Simulate a questionaire in which participants responses differ by gender.