Using R to simulate data

If you understand data structures you can usually choose and run the right analysis. If you do not consider the basic properties of your data then not only will you not know which analysis to use, but you will probably not even be able to import the data into any software in order to run it!

Simple R programming

Try running some simple commands by stepping through them in code chunks.

1+1
## [1] 2

Experiment some more. For example ..

5*10
## [1] 50
12/4
## [1] 3
3^2
## [1] 9

There is a lot more maths functionality built into the R language that you can find out about as you go on. However to follow the primer you do not need to learn any more commands than are shown in this document.

Note that when you type maths in the console the output is just printed to the screen. This is not stored anywhere in the computer’s memory. So we need to bring in the concept of data objects. A data object “holds” the numbers and can be manipulated by commands. This is how R manages to run analyses on your data. When you import a file containing data it will be stored into a data object.

The simplest data object is a variable. If a variable only contains one number it is known formally as a scalar. A variable with many numbers is a vector. So let’s assign a single value to a variable to form a scalar. We do that in R using the <- operator which is typed as a combination of the less than sign < with a horizontal bar -.

x<-5

Nothing appears to have happened! However you now have your first data object, a scalar called x. This is stored in memory. So try this.

x*2
## [1] 10

Notice that this would not have worked if you had typed X*2 as R is case sensitive.

So, how do we form a vector with multiple values? When you are analysing your own data the answer is that you usually won’t need to. You import all the values from a file. But if you wish to form a vector in the console you must use the concatenation operator “c”. So this gives the variable x a series of values.

x<-c(1,4,5,9,10,11,12,34,56,67,100,123,45)

Now see what happens if you type

x*2
##  [1]   2   8  10  18  20  22  24  68 112 134 200 246  90

You can carry out any sort of mathematical operation that involves x and all the values in the vector will be used. Notice that the results are just printed out to the console and lost.

If you want to assign the results to a new variable you use the “<-” operator again. This is a very common practice when analysing data. So, say you are intending to work with the natural logarithm of x you might write.

logx<-log(x)

You can see that this has worked by writing the new variable name so that R prints out the contents to the console.

logx
##  [1] 0.000000 1.386294 1.609438 2.197225 2.302585 2.397895 2.484907
##  [8] 3.526361 4.025352 4.204693 4.605170 4.812184 3.806662

This time you can see more clearly the purpose of the indices in the output. The second line starts with the 12th number in the vector.

You can find the names of the data objects that are held in memory by typing ls().

ls()
## [1] "logx" "x"

Data structures

Now we can begin looking at data structures. You can ask R to tell you about the structure of any object that it has in memory by typing str().

str(x)
##  num [1:13] 1 4 5 9 10 11 12 34 56 67 ...
str(logx)
##  num [1:13] 0 1.39 1.61 2.2 2.3 ...

So R has told us that both x and logx are numerical vectors. If you have a lot of data you probably do not want to look at all of it at once.

How does this relate to choosing an analysis? We have seen that this particular variable contains numbers. However in statistical analysis we also use “variables”" that don’t consist of numbers. They vary in another respect. If you look at any introductory statistical textbook you will see a chapter that defines types of variables in terms of interval, ordinal and scale variables and nominal and ordinal variables.

The sort of analysis that you will be able to run is determined by the sort of variable, or combination of variables, that you are working with.

Let’s set up a categorical variable. Experienced users of R often want to replicate values in order to set up some test data as Dytham suggests. So R has a range of functions for making data. The rep function replicates the values.

gender<-rep(c("Male","Female"),each=10)
gender
##  [1] "Male"   "Male"   "Male"   "Male"   "Male"   "Male"   "Male"  
##  [8] "Male"   "Male"   "Male"   "Female" "Female" "Female" "Female"
## [15] "Female" "Female" "Female" "Female" "Female" "Female"

Now if we ask R about the data structure it will tell us that we have a character vector.

str(gender)
##  chr [1:20] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" ...

In statistics categorical variables are referred to as factors. Factors are special character vectors with numbered levels. R automatically assumes that any column in a data file that contains non numeric values is a factor and converts the data to that form. In this case we need to tell R to turn the character vector into a factor.

gender<-as.factor(gender)
str(gender)
##  Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...

Now R tells us that gender is a factor with two levels. There is no intrinsic order to the levels. R places them in alphabetical order by default. The important element to be aware of is that gender is a variable. It varies in a known and specific way (it has two levels), but it is a variable all the same. You cannot calculate means, medians, standard deviations or any similar summary statistics using a factor. However you can, and will, use them together with numerical variables in many types of analysis.

Let’s produce a numerical variable to go alongside this. In the UK the mean height of men is around 176 cm and women 164 cm. So we could produce a vector with these values using this code. What we need to do is to replicate the “expected value” for each gender to be placed alongside the factor level.

height<-rep(c(176,164),each=10)
height
##  [1] 176 176 176 176 176 176 176 176 176 176 164 164 164 164 164 164 164
## [18] 164 164 164
str(height)
##  num [1:20] 176 176 176 176 176 176 176 176 176 176 ...

So now we have another numerical variable. However if we really carried out a survey of people’s heights we would be absolutely amazed if we got data like this. Although the numbers represent an estimate of the expected value for each of the ten men and women we know from experience that people’s heights vary around this value. In fact they vary quite a lot. The standard deviation is around 6cm. So we need to add some variability. This demonstrates clearly how statistical procedures work. Data consists of two components. It has an underlying structure that we typically want to find out about. In this case there is a difference in heights which is related to gender. There is also random variability, somtimes called the stochastic component. We are usually less interested in this directly, but we need to be very careful to make justifiable assumptions about this variability in order to correctly analyse the data.

Knowing this we can now make our variable more realistic by adding in some simulated values taken from a normal distribution with this standard deviation.

set.seed(1)
height<-height+rnorm(20,sd=6)
height
##  [1] 172.2413 177.1019 170.9862 185.5717 177.9770 171.0772 178.9246
##  [8] 180.4299 179.4547 174.1677 173.0707 166.3391 160.2726 150.7118
## [15] 170.7496 163.7304 163.9029 169.6630 168.9273 167.5634

We may want to round the values to one decimal place to make them equivalent to the sort of measurements we might make in practice.

height<-round(height,1)

Now we have two variables that are held in R’s memory. We are assuming that they both form part of a simulated data set that we could have obtained if we had measured a stratified random sample of twenty students consisting of ten men and ten women. Let’s call the survey “hsurvey”" and make what is known as a data frame to hold the results.

hsurvey<-data.frame(gender,height)
str(hsurvey)
## 'data.frame':    20 obs. of  2 variables:
##  $ gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ height: num  172 177 171 186 178 ...

So we have now made up data that has the same structure and similar properties to the results we would get from carrying out a simple survey of people’s heights. The basic concepts used to do this can be extended to more complex situations and used to help the design of experiments.

A data frame is the basis of almost all statistical analysis. It consists of two or more columns that line up in such a way that the measurements or observations recorded have been taken from the same individual member of a sample. This is often equivalent to a single sheet in a spreadsheet. R tells us that we have one factor and one numerical variable in this case. We might have measured many other variables at the same time and produced a much wider set of data. These may have been either categorical or numerical. Providing that they all “line up”" and correspond to the same individual we have our data frame. Many standard statistical techniques use two variables together. Later on in the course you will see how we can analyse data consisting of more than two variables using multivariate analysis.

The complete data frame is shown below.

hsurvey
##    gender height
## 1    Male  172.2
## 2    Male  177.1
## 3    Male  171.0
## 4    Male  185.6
## 5    Male  178.0
## 6    Male  171.1
## 7    Male  178.9
## 8    Male  180.4
## 9    Male  179.5
## 10   Male  174.2
## 11 Female  173.1
## 12 Female  166.3
## 13 Female  160.3
## 14 Female  150.7
## 15 Female  170.7
## 16 Female  163.7
## 17 Female  163.9
## 18 Female  169.7
## 19 Female  168.9
## 20 Female  167.6

If you want to remove all the other variables you set up from memory and leave just the data frame you could type.

remove(x,logx,height,gender)
ls()
## [1] "hsurvey"

Now if we ask R to print height or gender it will not find them. They have been placed within the data frame.

height
## Error in eval(expr, envir, enclos): object 'height' not found
gender
## Error in eval(expr, envir, enclos): object 'gender' not found

We can refer to the variables by writing out the full name of the data frame and the variable we want, separated by “$”

hsurvey$height
##  [1] 172.2 177.1 171.0 185.6 178.0 171.1 178.9 180.4 179.5 174.2 173.1
## [12] 166.3 160.3 150.7 170.7 163.7 163.9 169.7 168.9 167.6
hsurvey$gender
##  [1] Male   Male   Male   Male   Male   Male   Male   Male   Male   Male  
## [11] Female Female Female Female Female Female Female Female Female Female
## Levels: Female Male

Or we can attach the data frame. This makes the variables available directly. Notice that if we have several data frames containing variables with the same name this could cause confusion (so I don’t usually do this)

attach(hsurvey)
height
##  [1] 172.2 177.1 171.0 185.6 178.0 171.1 178.9 180.4 179.5 174.2 173.1
## [12] 166.3 160.3 150.7 170.7 163.7 163.9 169.7 168.9 167.6
gender
##  [1] Male   Male   Male   Male   Male   Male   Male   Male   Male   Male  
## [11] Female Female Female Female Female Female Female Female Female Female
## Levels: Female Male

Saving and loading data frames

Now that we have made up some data we might want to save it in the format that we will eventually use to capture our real data. The simplest, most portable data format is a CSV (Comma Separated Variable) file. Such a file can be easily read by all software.

First we must find a place to store the data. The standard practice is to make a separate data directory (folder) for each analysis that you carry out. You then set this as the working directory using the menu options in the R console. Once you have set your working directory (which could be on a USB stick for portability) you can save the data that is held in memory using an R command.

write.csv(hsurvey,file="hsurvey.csv",row.names=FALSE)

The command has three elements. The first is the name of the data frame that you wish to save. This is not quoted. The second is the name of the file into which you are going to save the data. The third tells R not to add row names to the file (these are not usually necessary).

Data frames are the standard input to all statistical analysis and are of a consistent form. Many different sorts of data summaries and tables can be generated from a data frame quickly by statistical software.

You can check the working directory using this command.

getwd()
## [1] "/home/rstudio/webpages/Quantitative_and_Spatial_Analysis/Week_8"
#In my case I am running R on Linux so the paths look different to Windows

To see a list of files in the directory type dir()

dir()
##[1] hsurvey.csv

If we remove the dataframe we formed we will be left with nothing in R’s memory

remove(hsurvey)
ls()

To load it back from the file type

hsurvey<-read.csv("hsurvey.csv")
str(hsurvey)

Histogram of simulated data

library(ggplot2)
g0<-ggplot(hsurvey,aes(x=height))
g0+geom_histogram(fill="grey",colour="black",binwidth=3)+facet_wrap(~gender,nrow=2)

So we have simulated some data, saved it, and reloaded the data into R. We have then looked at it as a histogram.

Saving a workspace

You can save the

save(list=ls(),file="height_study.rda")

Installing packages

Many of the more advanced features of R have been implemented within add on packages. These are held online at numerous mirror sites throughout the world. Packages are what makes R the important research tool that it is. Notice that the top articles from Methods in Ecology and Evolution present recently built R packages.

In order to use packages you have to install them to your hard disk. You do this once, and need to be online in order to download them. Once installed you make a package available at the start of a session using the command ``library’’. All the most useful packages have been installed already on the PCs in the lab. You may want to add some packages to your laptop. This can be done either using the menu options at the top of the console or by writing a command (if you know the names of the packages you need).

For example the following line will install the package vegan. This provides access to a large range of useful methods for analysing biodiversity and community data.

install.packages("vegan")

Once installed you can use vegan by typing

library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-2

During the course I will tell you which packages are most useful, and what they are useful for. I will also show you how to get help on functions in order to work independently. For now, just be aware that if R tells you that “there is no package called vegan”" if you type “library(vegan)” it means that the package has not yet been downloaded. You would then need to install it first.

More vector functions

As we have seen functions act on the whole vector at once. This applies also to functions that return only one number as a result.

x<-c(1,3,5,7,4,2,7)
sum(x)
## [1] 29
mean(x)
## [1] 4.142857
length(x)
## [1] 7
var(x)
## [1] 5.47619
sd(x)
## [1] 2.340126

It is very important to realise that a vector with NA values can cause problems for these sorts of functions. You need to tell R explicitly to remove the NAs. If not the result itself will be an NA. For example.

x<-c(1,2,3,4,5,6)
mean(x)
## [1] 3.5
x<-c(1,2,3,4,5,NA)
mean(x)
## [1] NA
mean(x,na.rm=T)
## [1] 3

This is a very common pitfall for beginners. In general, if something does not work as you expect look for NAs!

Generating sequences of numbers in R

One of the most useful features of R is the ease with which you can generate sequences of numbers and simulated data sets. To produce a sequence you can use the : syntax.

x<-0:100
0:100
##   [1]   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
##  [18]  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33
##  [35]  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50
##  [52]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67
##  [69]  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84
##  [86]  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
x<-30:10
x
##  [1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10

A longer but more flexible way of producing a sequence is with seq. For example to produce even numbers between 0 and 100.

x<-seq(0,100,by=2)
x
##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32
## [18]  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66
## [35]  68  70  72  74  76  78  80  82  84  86  88  90  92  94  96  98 100

Say we know the length of the sequence we want but have not worked out the intervals.

x<-seq(0,100,length=23) 
x
##  [1]   0.000000   4.545455   9.090909  13.636364  18.181818  22.727273
##  [7]  27.272727  31.818182  36.363636  40.909091  45.454545  50.000000
## [13]  54.545455  59.090909  63.636364  68.181818  72.727273  77.272727
## [19]  81.818182  86.363636  90.909091  95.454545 100.000000

Using rep to replicate a vector

If we want ten copies of the same vector one after the other we can use.

x<-rep(c(1,4,9,23),times=10) 
x
##  [1]  1  4  9 23  1  4  9 23  1  4  9 23  1  4  9 23  1  4  9 23  1  4  9
## [24] 23  1  4  9 23  1  4  9 23  1  4  9 23  1  4  9 23

However we might want each number in the vector replicated ten times before moving to the next. In this case we use each instead of times.

x<-rep(c(1,4,9,23),each=10)
x
##  [1]  1  1  1  1  1  1  1  1  1  1  4  4  4  4  4  4  4  4  4  4  9  9  9
## [24]  9  9  9  9  9  9  9 23 23 23 23 23 23 23 23 23 23

Replicating text

When designing a format to hold the results from a planned experiment or field survey it can be very useful to generate replicated sequences of text for the factor levels or grouping variables. This is also very easy.

x<-rep(c("Control","Treatment1","Treatment2"),each=10)
x
##  [1] "Control"    "Control"    "Control"    "Control"    "Control"   
##  [6] "Control"    "Control"    "Control"    "Control"    "Control"   
## [11] "Treatment1" "Treatment1" "Treatment1" "Treatment1" "Treatment1"
## [16] "Treatment1" "Treatment1" "Treatment1" "Treatment1" "Treatment1"
## [21] "Treatment2" "Treatment2" "Treatment2" "Treatment2" "Treatment2"
## [26] "Treatment2" "Treatment2" "Treatment2" "Treatment2" "Treatment2"

Or of course

x<-rep(c("Control","Treatment1","Treatment2"),times=10)
x
##  [1] "Control"    "Treatment1" "Treatment2" "Control"    "Treatment1"
##  [6] "Treatment2" "Control"    "Treatment1" "Treatment2" "Control"   
## [11] "Treatment1" "Treatment2" "Control"    "Treatment1" "Treatment2"
## [16] "Control"    "Treatment1" "Treatment2" "Control"    "Treatment1"
## [21] "Treatment2" "Control"    "Treatment1" "Treatment2" "Control"   
## [26] "Treatment1" "Treatment2" "Control"    "Treatment1" "Treatment2"

Logical vectors and subsetting

One of the keys to using R efficiently is the concept of logical vectors, indices and subsetting. However the concept does take a little effort to get used to. Let’s take it a step at a time. Say we have a vector x which we have setup like this.

x<-seq(-4,10,by=2)
x<-rep(x,times=3)
x
##  [1] -4 -2  0  2  4  6  8 10 -4 -2  0  2  4  6  8 10 -4 -2  0  2  4  6  8
## [24] 10

Now we can ask a question that will be answered as true or false for each of the numbers. Is the element of x greater than zero?

x>0
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE

R replies with a vector stating whether the result is true or false. We can also ask which of the numbers is greater than zero.

which(x>0)
##  [1]  4  5  6  7  8 12 13 14 15 16 20 21 22 23 24

Now R replies with the indices of the elements of the vector.

Subsetting the vector involves the use of the square brackets {[} and {]}. If you include either a logical vector with TRUE and FALSE values or numeric indices within the square brackets R will subset the vector. Either way works.

x[x>0]
##  [1]  2  4  6  8 10  2  4  6  8 10  2  4  6  8 10
x[which(x>0)]
##  [1]  2  4  6  8 10  2  4  6  8 10  2  4  6  8 10

When we move on to handling more than one variable at a time using data frames we will see that the same concept can be used to subset whole blocks of data.It is a very powerful and fundamentally simple way of manipulating data.

A more complex example is given here. This takes every second member of x using a sequence of indices as a subset.

x[seq(2,length(x),by=2)]
##  [1] -2  2  6 10 -2  2  6 10 -2  2  6 10

Could you follow how this worked? Working outwards

  • length(x) gives the number of elements in x. Lets call this n
  • seq(0,length(x),by=2) gives the vector of indices 2,4,6,….n.
  • A subset of x is found using the square brackets.

With experience it is common to wrap up several steps in a single line. There is nothing wrong with explicitly writing

n<-length(x)
ind<-seq(2,n,by=2)
x[ind]
##  [1] -2  2  6 10 -2  2  6 10 -2  2  6 10

Sorting and ordering

We can also sort vectors. There are two ways of doing this. The first is very direct, but I really do not recommend it. It uses the sort command with the argument decreasing=TRUE or FALSE

sort(x,decreasing=T)
##  [1] 10 10 10  8  8  8  6  6  6  4  4  4  2  2  2  0  0  0 -2 -2 -2 -4 -4
## [24] -4

That’s simple enough. However it is much better practice in the long run to use order. This needs a little explaining, again we will take it step by step.

order(x,decreasing=T)
##  [1]  8 16 24  7 15 23  6 14 22  5 13 21  4 12 20  3 11 19  2 10 18  1  9
## [24] 17

Order has not given us the numbers in order! But of course it should not, as sort does that. Instead order has given us the the indices in order. Notice that if there are ties, as in this case, R respects the original order for the tied numbers. So how can we use order to sort the vector? If you have followed the logic of using indices to refer to elements of a vector you might have guessed.

x[order(x,decreasing=T)]
##  [1] 10 10 10  8  8  8  6  6  6  4  4  4  2  2  2  0  0  0 -2 -2 -2 -4 -4
## [24] -4

Although this involves more typing than simply writing sort, it is more powerful. The power comes from the way indices can be used with many variables at once. When we move on to see data frames this will be clearer. For the moment look at this simple example to see the logic.

x<-c(1,3,2,4)
y<-c("A","B","C","D")
y[order(x,decreasing=F)]
## [1] "A" "C" "B" "D"

Ranks

Finding ranks is a very common procedure in statistics. This is a slightly different problem. We need a way to deal with ties, and there is more than one solution to the problem.

x<-c(1,1,2,3,4,5,5)
rank(x,ties="average")
## [1] 1.5 1.5 3.0 4.0 5.0 6.5 6.5
rank(x,ties="max")
## [1] 2 2 3 4 5 7 7
rank(x,ties="min")
## [1] 1 1 3 4 5 6 6

The last example coincides with the familiar ranks given in sports (joint gold medal followed by the bronze). Notice that there is no decreasing argument to ranks. The lowest numbers take the lowest ranks. If you really want to rank performance scores you must reverse them first by, say, subtracting all the scores from the maximum possible.

x<-c(1,1,2,3,4,5,5)
rank(max(x)-x,ties="min") 
## [1] 6 6 5 4 3 1 1

Exercise

In order to practice putting together these commands in order to simulate some data try this.

A researcher is interested in looking at the differences in diameters of trees in three different woodland sites in the New Forest. At each site there are several different species. In order to simplify the task we will consider only two types of trees … conifers and broadleaves. We will also simplify the exercise by assuming that the same number of trees (50) are sampled in each woodland.

Set up a dataframe with three columns. One column represents the site. The second represents the type of tree (i.e. conifer or broadleaf). The third represents the diameters. So there will be 150 observations (rows) in all. Try to produce data in which there is a difference in mean diameter that is affected by the site from which the measurements are taken and the type of tree being measured. You can assume that the random variation in diameters is normally distributed and that measurements are taken to the nearest cm.