During the course of this unit you will develop the skills to be able to apply R code to the analysis of a range of data sets. Although crib sheets will be provided with model code for all the analytical techniques shown on the course, in order to manipulate data effectively and apply novel techniques it is important to begin to become more comfortable with basic R programming techniques and concepts. This week we will revise some of the fundamentals of R programming.
Note that dplyr and the “tidyverse” set of functions now make programming large data sets much more agile and elegant than would be the case if we only used the original base R language. However many of the functions that you will find in packages are written using base R. In order to understand other people’s code you need some familiarity with the standard R language.
If you understand data structures you can usually choose and run the right analysis. If you do not consider the basic properties of your data then not only will you not know which analysis to use, but you will probably not even be able to import the data into any software in order to run it!
If you are still completely new to R try running some simple commands by stepping through them in code chunks. When working in RStudio notice that the results are shown under the code chunk when you click on the green arrrow. If you are comfortable with basic R then skip this section.
1+1
## [1] 2
Experiment some more. For example ..
5*10
## [1] 50
12/4
## [1] 3
3^2
## [1] 9
There is a lot more maths functionality built into the R language that you can find out about as you go on. However to follow the primer you do not need to learn any more commands than are shown in this document.
Note that when you type maths in the console the output is just printed to the screen. This is not stored anywhere in the computer’s memory. So we need to bring in the concept of data objects. A data object “holds” the numbers and can be manipulated by commands. This is how R manages to run analyses on your data. When you import a file containing data it will be stored into a data object.
The simplest data object is a variable. If a variable only contains one number it is known formally as a scalar. A variable with many numbers is a vector. So let’s assign a single value to a variable to form a scalar. We do that in R using the <- operator which is typed as a combination of the less than sign < with a horizontal bar -.
x<-5
Nothing appears to have happened! However you now have your first data object, a scalar called x. This is stored in memory. So try this.
x*2
## [1] 10
Notice that this would not have worked if you had typed X*2 as R is case sensitive.
So, how do we form a vector with multiple values? When you are analysing your own data the answer is that you usually won’t need to. You import all the values from a file. But if you wish to form a vector in the console you must use the concatenation operator “c”. So this gives the variable x a series of values.
x<-c(1,4,5,9,10,11,12,34,56,67,100,123,45)
Now see what happens if you type
x*2
## [1] 2 8 10 18 20 22 24 68 112 134 200 246 90
You can carry out any sort of mathematical operation that involves x and all the values in the vector will be used. Notice that the results are just printed out to the console and lost.
If you want to assign the results to a new variable you use the “<-” operator again. This is a very common practice when analysing data. So, say you are intending to work with the natural logarithm of x you might write.
logx<-log(x)
log
## function (x, base = exp(1)) .Primitive("log")
You can see that this has worked by writing the new variable name so that R prints out the contents to the console.
logx
## [1] 0.000000 1.386294 1.609438 2.197225 2.302585 2.397895 2.484907
## [8] 3.526361 4.025352 4.204693 4.605170 4.812184 3.806662
This time you can see more clearly the purpose of the indices in the output. The second line starts with the 12th number in the vector.
You can find the names of the data objects that are held in memory by typing ls(). In Rstudio it is easier to look in the Environment tab in order to see the objects in working memory.
ls()
## [1] "logx" "x"
Now we can begin looking at data structures. You can ask R to tell you about the structure of any object that it has in memory by typing str().
str(x)
## num [1:13] 1 4 5 9 10 11 12 34 56 67 ...
str(logx)
## num [1:13] 0 1.39 1.61 2.2 2.3 ...
So R has told us that both x and logx are numerical vectors. If you have a lot of data you probably do not want to look at all of it at once.
How does this relate to choosing an analysis? We have seen that this particular variable contains numbers. However in statistical analysis we also use “variables”" that don’t consist of numbers. They vary in another respect. If you look at any introductory statistical textbook you will see a chapter that defines types of variables in terms of interval, ordinal and scale variables and nominal and ordinal variables.
As we have seen functions act on the whole vector at once. This applies also to functions that return only one number as a result.
x<-c(1,3,5,7,4,2,7)
sum(x)
## [1] 29
mean(x)
## [1] 4.142857
length(x)
## [1] 7
var(x)
## [1] 5.47619
sd(x)
## [1] 2.340126
se<-function(x)sd(x)/(sqrt(length(x)))
se(x)
## [1] 0.8844846
It is very important to realise that a vector with NA values can cause problems for these sorts of functions. You need to tell R explicitly to remove the NAs. If not the result itself will be an NA. For example.
x<-c(1,2,3,4,5,6)
mean(x)
## [1] 3.5
x<-c(1,2,3,4,5,NA)
mean(x)
## [1] NA
mean(x,na.rm=T)
## [1] 3
This is a very common pitfall for beginners. In general, if something does not work as you expect look for NAs!
One of the most useful features of R is the ease with which you can generate sequences of numbers and simulated data sets. To produce a sequence you can use the : syntax.
x<-0:100
x
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## [18] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## [35] 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [52] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
## [69] 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
## [86] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
x<-30:10
x
## [1] 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
A longer but more flexible way of producing a sequence is with seq. For example to produce even numbers between 0 and 100.
x<-seq(0,100,by=5)
x
## [1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
## [18] 85 90 95 100
Say we know the length of the sequence we want but have not worked out the intervals.
x<-seq(0,100,length=23)
x
## [1] 0.000000 4.545455 9.090909 13.636364 18.181818 22.727273
## [7] 27.272727 31.818182 36.363636 40.909091 45.454545 50.000000
## [13] 54.545455 59.090909 63.636364 68.181818 72.727273 77.272727
## [19] 81.818182 86.363636 90.909091 95.454545 100.000000
If we want ten copies of the same vector one after the other we can use.
x<-rep(c(1,4,9,23),times=10)
x
## [1] 1 4 9 23 1 4 9 23 1 4 9 23 1 4 9 23 1 4 9 23 1 4 9
## [24] 23 1 4 9 23 1 4 9 23 1 4 9 23 1 4 9 23
However we might want each number in the vector replicated ten times before moving to the next. In this case we use each instead of times.
x<-rep(c(1,4,9,23),each=10)
x
## [1] 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 9 9 9
## [24] 9 9 9 9 9 9 9 23 23 23 23 23 23 23 23 23 23
One of the keys to using base R efficiently is the concept of logical vectors, indices and subsetting. However the concept does take a little effort to get used to. Let’s take it a step at a time. Say we have a vector x which we have setup like this.
x<-seq(-4,10,by=2)
x<-rep(x,times=3)
x
## [1] -4 -2 0 2 4 6 8 10 -4 -2 0 2 4 6 8 10 -4 -2 0 2 4 6 8
## [24] 10
Now we can ask a question that will be answered as true or false for each of the numbers. Is the element of x greater than zero?
x>0
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [12] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## [23] TRUE TRUE
R replies with a vector stating whether the result is true or false. We can also ask which of the numbers is greater than zero.
which(x>0)
## [1] 4 5 6 7 8 12 13 14 15 16 20 21 22 23 24
Now R replies with the indices of the elements of the vector.
Subsetting the vector involves the use of the square brackets {[} and {]}. If you include either a logical vector with TRUE and FALSE values or numeric indices within the square brackets R will subset the vector. Either way works.
x[x>0]
## [1] 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
x[which(x>0)]
## [1] 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
When we move on to handling more than one variable at a time using data frames we will see that the same concept can be used to subset whole blocks of data.It is a very powerful and fundamentally simple way of manipulating data.
A more complex example is given here. This takes every second member of x using a sequence of indices as a subset.
x[seq(2,length(x),by=2)]
## [1] -2 2 6 10 -2 2 6 10 -2 2 6 10
Could you follow how this worked? Working outwards
With experience it is common to wrap up several steps in a single line. There is nothing wrong with explicitly writing
n<-length(x)
ind<-seq(2,n,by=2)
x[ind]
## [1] -2 2 6 10 -2 2 6 10 -2 2 6 10
We can also sort vectors. There are two ways of doing this. The first is very direct, but I really do not recommend it. It uses the sort command with the argument decreasing=TRUE or FALSE
sort(x,decreasing=T)
## [1] 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 -2 -2 -2 -4 -4
## [24] -4
That’s simple enough. However it is much better practice in the long run to use order. This needs a little explaining, again we will take it step by step.
order(x,decreasing=T)
## [1] 8 16 24 7 15 23 6 14 22 5 13 21 4 12 20 3 11 19 2 10 18 1 9
## [24] 17
Order has not given us the numbers in order! But of course it should not, as sort does that. Instead order has given us the the indices in order. Notice that if there are ties, as in this case, R respects the original order for the tied numbers. So how can we use order to sort the vector? If you have followed the logic of using indices to refer to elements of a vector you might have guessed.
x[order(x,decreasing=T)]
## [1] 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 -2 -2 -2 -4 -4
## [24] -4
Although this involves more typing than simply writing sort, it is more powerful. The power comes from the way indices can be used with many variables at once. When we move on to see data frames this will be clearer. For the moment look at this simple example to see the logic.
x<-c(1,3,2,4)
y<-c("A","B","C","D")
y[order(x,decreasing=F)]
## [1] "A" "C" "B" "D"
Finding ranks is a very common procedure in statistics. This is a slightly different problem. We need a way to deal with ties, and there is more than one solution to the problem.
x<-c(1,1,2,3,4,5,5)
rank(x,ties="average")
## [1] 1.5 1.5 3.0 4.0 5.0 6.5 6.5
rank(x,ties="max")
## [1] 2 2 3 4 5 7 7
rank(x,ties="min")
## [1] 1 1 3 4 5 6 6
The last example coincides with the familiar ranks given in sports (joint gold medal followed by the bronze). Notice that there is no decreasing argument to ranks. The lowest numbers take the lowest ranks. If you really want to rank performance scores you must reverse them first by, say, subtracting all the scores from the maximum possible.
x<-c(1,1,2,3,4,5,5)
rank(max(x)-x,ties="min")
## [1] 6 6 5 4 3 1 1
When designing a format to hold the results from a planned experiment or field survey it can be very useful to generate replicated sequences of text for the factor levels or grouping variables. This is also very easy.
x<-rep(c("Control","Treatment1","Treatment2"),each=10)
x
## [1] "Control" "Control" "Control" "Control" "Control"
## [6] "Control" "Control" "Control" "Control" "Control"
## [11] "Treatment1" "Treatment1" "Treatment1" "Treatment1" "Treatment1"
## [16] "Treatment1" "Treatment1" "Treatment1" "Treatment1" "Treatment1"
## [21] "Treatment2" "Treatment2" "Treatment2" "Treatment2" "Treatment2"
## [26] "Treatment2" "Treatment2" "Treatment2" "Treatment2" "Treatment2"
Or of course
x<-rep(c("Control","Treatment1","Treatment2"),times=10)
x
## [1] "Control" "Treatment1" "Treatment2" "Control" "Treatment1"
## [6] "Treatment2" "Control" "Treatment1" "Treatment2" "Control"
## [11] "Treatment1" "Treatment2" "Control" "Treatment1" "Treatment2"
## [16] "Control" "Treatment1" "Treatment2" "Control" "Treatment1"
## [21] "Treatment2" "Control" "Treatment1" "Treatment2" "Control"
## [26] "Treatment1" "Treatment2" "Control" "Treatment1" "Treatment2"
R can simulate data from a very wide range of statistical distributions. For example, to obtain data from a normal distribution with known mean and standard deviation
#set.seed(3)
x<-rnorm(100,mean=100,sd=5)
x
## [1] 100.30344 96.29100 97.10176 107.93947 107.19202 103.04138 100.41204
## [8] 107.43920 102.07982 94.06930 96.69453 97.05326 107.17393 99.66833
## [15] 102.98536 101.07433 94.68170 99.28451 96.10755 94.58763 94.43264
## [22] 104.91310 85.48325 95.15281 91.83711 104.69468 95.62195 98.77496
## [29] 95.63406 99.97491 99.93658 105.18039 100.80050 102.91213 92.70532
## [36] 95.69861 107.81804 95.50119 97.48623 100.94342 96.65028 96.29317
## [43] 97.26447 97.53198 103.91835 96.62180 100.70242 97.54585 93.84461
## [50] 101.38822 97.40773 98.27474 105.49472 104.93931 103.86474 100.97724
## [57] 100.39074 100.09587 91.46960 105.98903 106.14672 96.78145 88.78748
## [64] 104.68346 91.93805 97.05598 111.58469 103.36470 93.90254 99.11641
## [71] 103.06722 100.29620 109.06518 99.41406 114.69081 104.81202 101.05142
## [78] 104.21799 100.86226 97.47517 98.59091 99.94579 107.82639 95.53632
## [85] 92.80124 108.25171 105.81577 96.69260 100.08763 98.38736 91.70523
## [92] 102.79688 99.06551 100.10597 95.90511 97.51619 104.87773 103.97132
## [99] 112.24093 112.01080
The base R histogram is a quick way of visualising this.
hist(x)
Count data can be simulated using a poisson distribution
x<-rpois(100,lambda=2)
hist(x)
table(x)
## x
## 0 1 2 3 4 5 6
## 11 27 27 22 10 2 1
Right skewed data can be simulated using a log normal distribution.
x<-rlnorm(10000,2,0.5)
hist(x)
Bounded data can be simulated using a beta distribution
hist(rbeta(100,5, 5))
Let’s simulate a simple measurement based survey. Say we have decided to look at the difference between heights of male and female students at BU. Our simple survey will be based on a random sample of ten male and ten female students. Let’s make up some data that will have properties that are similar to the empirical data that we might obtain.
We will first set up a categorical variable.
gender<-rep(c("Male","Female"),each=10)
gender
## [1] "Male" "Male" "Male" "Male" "Male" "Male" "Male"
## [8] "Male" "Male" "Male" "Female" "Female" "Female" "Female"
## [15] "Female" "Female" "Female" "Female" "Female" "Female"
If we ask R about the data structure it will tell us that we have a character vector.
str(gender)
## chr [1:20] "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" ...
In statistics categorical variables are referred to as factors. Factors are special character vectors with numbered levels. R automatically assumes that any column in a data file that contains non numeric values is a factor and converts the data to that form. In this case we need to tell R to turn the character vector into a factor.
gender<-as.factor(gender)
str(gender)
## Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
Now R tells us that gender is a factor with two levels. There is no intrinsic order to the levels. R places them in alphabetical order by default. The important element to be aware of is that gender is a variable. It varies in a known and specific way (it has two levels), but it is a variable all the same. You cannot calculate means, medians, standard deviations or any similar summary statistics using a factor. However you can, and will, use them together with numerical variables in many types of analysis.
Let’s produce a numerical variable to go alongside this. In the UK the mean height of men is around 176 cm and women 164 cm. So we could produce a vector with these values using this code. What we need to do is to replicate the “expected value” for each gender to be placed alongside the factor level.
height<-rep(c(176,164),each=10)
height
## [1] 176 176 176 176 176 176 176 176 176 176 164 164 164 164 164 164 164
## [18] 164 164 164
str(height)
## num [1:20] 176 176 176 176 176 176 176 176 176 176 ...
So now we have another numerical variable. However if we really carried out a survey of people’s heights we would be absolutely amazed if we got data like this. Although the numbers represent an estimate of the expected value for each of the ten men and women we know from experience that people’s heights vary around this value. In fact they vary quite a lot. The standard deviation is around 6cm. So we need to add some variability. This demonstrates clearly how statistical procedures work. Data consists of two components. It has an underlying structure that we typically want to find out about. In this case there is a difference in heights which is related to gender. There is also random variability, somtimes called the stochastic component. We are usually less interested in this directly, but we need to be very careful to make justifiable assumptions about this variability in order to correctly analyse the data.
Knowing this we can now make our variable more realistic by adding in some simulated values taken from a normal distribution with this standard deviation.
set.seed(1)
height<-height+rnorm(20,sd=6)
height
## [1] 172.2413 177.1019 170.9862 185.5717 177.9770 171.0772 178.9246
## [8] 180.4299 179.4547 174.1677 173.0707 166.3391 160.2726 150.7118
## [15] 170.7496 163.7304 163.9029 169.6630 168.9273 167.5634
We may want to round the values to one decimal place to make them equivalent to the sort of measurements we might make in practice.
height<-round(height,1)
Now we have two variables that are held in R’s memory. We are assuming that they both form part of a simulated data set that we could have obtained if we had measured a stratified random sample of twenty students consisting of ten men and ten women. Let’s call the survey “hsurvey”" and make what is known as a data frame to hold the results.
hsurvey<-data.frame(gender,height)
str(hsurvey)
## 'data.frame': 20 obs. of 2 variables:
## $ gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ height: num 172 177 171 186 178 ...
So we have now made up data that has the same structure and similar properties to the results we would get from carrying out a simple survey of people’s heights. The basic concepts used to do this can be extended to more complex situations and used to help the design of experiments.
A data frame is the basis of almost all statistical analysis. It consists of two or more columns that line up in such a way that the measurements or observations recorded have been taken from the same individual member of a sample. This is often equivalent to a single sheet in a spreadsheet. R tells us that we have one factor and one numerical variable in this case. We might have measured many other variables at the same time and produced a much wider set of data. These may have been either categorical or numerical. Providing that they all “line up”" and correspond to the same individual we have our data frame. Many standard statistical techniques use two variables together. Later on in the course you will see how we can analyse data consisting of more than two variables using multivariate analysis.
The complete data frame is shown below.
library(dplyr)
library(aqm)
dt(hsurvey)
If you want to remove all the other variables you set up from memory and leave just the data frame you could type.
remove(x,logx,height,gender)
ls()
## [1] "hsurvey" "ind" "n" "se" "y"
Now if we ask R to print height or gender it will not find them. They have been placed within the data frame.
height
## Error in eval(expr, envir, enclos): object 'height' not found
gender
## Error in eval(expr, envir, enclos): object 'gender' not found
We can refer to the variables by writing out the full name of the data frame and the variable we want, separated by “$”
hsurvey$height
## [1] 172.2 177.1 171.0 185.6 178.0 171.1 178.9 180.4 179.5 174.2 173.1
## [12] 166.3 160.3 150.7 170.7 163.7 163.9 169.7 168.9 167.6
hsurvey$gender
## [1] Male Male Male Male Male Male Male Male Male Male
## [11] Female Female Female Female Female Female Female Female Female Female
## Levels: Female Male
Now that we have made up some data we might want to save it in the format that we will eventually use to capture our real data. The simplest, most portable data format is a CSV (Comma Separated Variable) file. Such a file can be easily read by all software.
First we must find a place to store the data. The standard practice is to make a separate data directory (folder) for each analysis that you carry out. You then set this as the working directory using the menu options in the R console. Once you have set your working directory (which could be on a USB stick for portability) you can save the data that is held in memory using an R command.
write.csv(hsurvey,file="hsurvey.csv",row.names=FALSE)
The command has three elements. The first is the name of the data frame that you wish to save. This is not quoted. The second is the name of the file into which you are going to save the data. The third tells R not to add row names to the file (these are not usually necessary).
Data frames are the standard input to all statistical analysis and are of a consistent form. Many different sorts of data summaries and tables can be generated from a data frame quickly by statistical software.
You can check the working directory using this command.
getwd()
## [1] "/home/rstudio/webpages/phd_course"
To see a list of files in the directory type dir()
dir()
##[1] hsurvey.csv
If we remove the dataframe we formed we will be left with nothing in R’s memory
remove(hsurvey)
ls()
The easy way to remove all files from memory is to use the environment tab in RStudio.
To load the data back from the file type
hsurvey<-read.csv("hsurvey.csv")
str(hsurvey)
library(ggplot2)
g0<-ggplot(hsurvey,aes(x=height))
g0+geom_histogram(fill="grey",colour="black",binwidth=3)+facet_wrap(~gender,nrow=2)
So we have simulated some data, saved it, and reloaded the data into R. We have then looked at it as a histogram.
bplot<-ggplot(hsurvey,aes(x=gender,y=height))
bplot +geom_boxplot()
bplot +
stat_summary(fun.y=mean,geom="point") +
stat_summary(fun.data=mean_cl_normal,geom="errorbar")
hsurvey %>%
group_by(gender) %>%
summarise(n=n(),mean=round(mean(height),2),sd=round(sd(height),2)) %>%
dt()
The simplest possible statistical test we could use on these data would be an unpaired t-test.
t.test(hsurvey$height~hsurvey$gender)
##
## Welch Two Sample t-test
##
## data: hsurvey$height by hsurvey$gender
## t = -4.4996, df = 16.471, p-value = 0.00034
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16.62611 -5.99389
## sample estimates:
## mean in group Female mean in group Male
## 165.49 176.80
It is very easy to simulate data with the structure of a regression equation.
\(y=a+bx+\epsilon\) where \(\epsilon=N(o,\sigma^{2})\)
x<-1:10
a<-5
b<-2
sig<-4
y<-a+b*x+rnorm(length(x),sd=sig)
d<-data.frame(x,y)
ggplot(d,aes(x=x,y=y))+ geom_point() +geom_smooth(method="lm")
I have added a simple function to the aqm package that will simulate some Likert scale data.
q1<-aqm::rand_likert(n=100,p=c(1,1,1,2,4))
q2<-aqm::rand_likert(n=100,p=c(4,2,1,1,1))
d<-data.frame(id=1:100,q1,q2) ## Note the wide format
## Make a data frame
df<-tidyr::pivot_longer(d, cols=-id,names_to="question",values_to = "response")
## Long format
str(d$q1)
## Ord.factor w/ 5 levels "Strongly disagree"<..: 1 5 4 5 4 5 4 3 5 2 ...
df %>% group_by(question, response) %>% summarise (n=n())
## # A tibble: 10 x 3
## # Groups: question [2]
## question response n
## <chr> <ord> <int>
## 1 q1 Strongly disagree 9
## 2 q1 Disagree 8
## 3 q1 Neutral 11
## 4 q1 Agree 31
## 5 q1 Strongly agree 41
## 6 q2 Strongly disagree 44
## 7 q2 Disagree 22
## 8 q2 Neutral 10
## 9 q2 Agree 14
## 10 q2 Strongly agree 10
library(giscourse)
conn<-connect()
arne<-sssi(dist=100)
points<-st_sample(arne,100)
mapview(arne) + mapview(points)
In order to practice putting together these commands in order to simulate some data try these exercises
Set up a dataframe with three columns. One column represents the site. The second represents the type of tree (i.e. conifer or broadleaf). The third represents the diameters. So there will be 150 observations (rows) in all. Try to produce data in which there is a difference in mean diameter that is affected by the site from which the measurements are taken and the type of tree being measured. You can assume that the random variation in diameters is normally distributed and that measurements are taken to the nearest cm.