Starting using R

This is an R Markdown document. You will see text in the white areas of the document and code in the grey “chunks”. You can amend any part of the text as you like. So you can use a markdown document as a template for your own report just by removing all my comments and replacing them with your own comments on the R code and output. However if you change anything at all within the grey areas you need to make sure that the R code makes sense and ensure that it runs.

In order to build up your own markdown document or adapt one you must run all the chunks in order to test that the code works. You will be shown how to do this in the class. Notice the green arrow at the top left of each chunk. This runs the code within that chunk. It is important to run the chunks in order and to run all the code above each chunk. If you don’t do this you will be trying to run code before you’ve loaded any data, or run code on data that should have been manipulated in some manner prior to running the code.

Code chunk options

The first point to notice about an R Markdown document is that it usually begins with a code chunk showing some options for “knitting” the document. Don’t worry about these for the moment. This code will not be shown when the document is knitted. Knitting refers to the process of turning the document into a finished report, in HTML, PDF or Word format. In order to include web maps you must knit the document as HTML.

Packages used

The next chunk will usually list the packages that are being used in the analysis. The library function loads them into memory ready for use. If you don’t load the library you can’t use a function from it.

library(mapview)
library(leaflet.extras)
library(ggplot2)
library(sf)
library(rgdal)

Loading data

Loading data into R is very different from the way you may be used to loading data. Don’t try to load data using the file menu at the top of RStudio! This is not there to load data. It is for loading scripts and markdown documents. In order to load in data you need to know where the data are stored on your computer (or in this case the server) and what the files are called. There is different R syntax to load different types of data. Although this may seem confusing and may cause difficulties and errors at the start (any typo in either the path or the filename will cause an error), it is actually very powerful, as you only need to change the name of the file at this single point in order to run the whole analysis again with a different input and different output. I’ll show the idea of this with the gps data that we collected first. More commonly you will be loading data in CSV format

Loading the group B gpx file

One group provided the a GPS labelled “Great Horseshoe Bat” for some reason. I placed the file from the GPS in the msc_data/gis_data folder under the name of Waypoints_13-OCT-17-B.gpx

So this line reads the file into R

group_b <- st_read(dsn = "/home/msc_data/gis_data/Waypoints_13-OCT-17-B.gpx", layer="waypoints")
## Reading layer `waypoints' from data source `/home/msc_data/gis_data/Waypoints_13-OCT-17-B.gpx' using driver `GPX'
## Simple feature collection with 27 features and 23 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -2.038495 ymin: 50.68562 xmax: -2.033997 ymax: 50.68792
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

This has placed the data into a spreadsheet like data frame called group_b. If you look in the “Environment” window in the top right corner of RStudio you’ll see group_b in the data category. If you click on this you’ll open a spreadsheet like table in R in order to eyeball the data.

Notice that there are a lot of columns with no data (NA) collected by the GPS. We can remove these by subsetting the columns. Don’t worry about the detail of the R code here yet. I’ll explain how it works as the course goes on, but the idea is that a data frame consists of rows and columns. The square brackets refer to [rows,columns]. You can form a vector of numbers in R using the c(1,2,5,24) syntax for “concatenating” numbers together. So group_b[,c(1,2,5,24)] gives the first,second, fifth and twenty fourth column. The <- operator assigns this new data frame back to the original. So we have got rid of the useless columns

group_b<-group_b[,c(1,2,5,24)]

So now we have got rid of the useless columns

Mapping the GPX data

We can now use the magic of a high level wrapper function to build a simple webmap with one line.

mapview(group_b)

Reading the other groups

There are two other GPX files in the data folder. We can read them in using the same approach.

group_f <- st_read(dsn = "/home/msc_data/gis_data/Waypoints_13-OCT-17-F.gpx", layer="waypoints")
## Reading layer `waypoints' from data source `/home/msc_data/gis_data/Waypoints_13-OCT-17-F.gpx' using driver `GPX'
## Simple feature collection with 26 features and 23 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -2.038524 ymin: 50.68661 xmax: -2.03726 ymax: 50.68791
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
group_f<-group_f[,c(1,2,5,24)]

group_h <- st_read(dsn = "/home/msc_data/gis_data/Waypoints_13-OCT-17-H.gpx", layer="waypoints")
## Reading layer `waypoints' from data source `/home/msc_data/gis_data/Waypoints_13-OCT-17-H.gpx' using driver `GPX'
## Simple feature collection with 23 features and 23 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -2.03367 ymin: 50.68785 xmax: -2.031533 ymax: 50.68877
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
group_h<-group_h[,c(1,2,5,24)]

Putting them together

We’ll now bind the three files together to make a single data frame.

group_b$group<-"b"
group_f$group<-"f"
group_h$group<-"h"

gps<-rbind(group_b,group_f,group_h)

Look at the new data frame called gps. Because the three data frames had exactly the same number of columns it was possible to bind them together. This can only work if the data frames have the same number of columns with the same names. Consistency is key to working with any data.

Note that R handles variable such as times very nicely, providing they are read in properly. If ask for a description of the data using the str function in R we can check the nature of the data.

str(gps)
## Classes 'sf' and 'data.frame':   76 obs. of  5 variables:
##  $ ele     : num  10.35 9.76 6.71 6.96 6.28 ...
##  $ time    : POSIXct, format: "2017-10-13 11:11:36" "2017-10-13 11:11:56" ...
##  $ name    : Factor w/ 54 levels "002","003","004",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ group   : chr  "b" "b" "b" "b" ...
##  $ geometry:sfc_POINT of length 76; first list element: Classes 'XY', 'POINT', 'sfg'  num [1:2] -2.04 50.69
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
##   ..- attr(*, "names")= chr  "ele" "time" "name" "group"

Ele(vation) is a numerical variable. Time is Posixct which means R can handle times cleanly. The name of the waypoints is a factor (note that it is not numeric). The names of the groups are characters. We can coerce between factors characters very easily. You’ll understand what a factor is later on when we look at statistical analysis.

If we ask R for the CRS it well tell us the epsg code used.

st_crs(gps)
## $epsg
## [1] 4326
## 
## $proj4string
## [1] "+proj=longlat +datum=WGS84 +no_defs"
## 
## attr(,"class")
## [1] "crs"

In this case it is EPSG 4326, which is simple unprojected Latitudes and Longitudes using the WGS 84 datum. This is suitable for data collection anywhere in the world, but data would need to be projected in order to measure areas and distances.

Just to show how neat time handling is in R let’s see how long the field work took in total.

max(gps$time)-min(gps$time)
## Time difference of 1.800278 hours

Now we can produce a map with all the data. I’ve added a full screen control so that the map can be looked at in more detail in the knitted document.

mp<-mapview(gps)
mp@map <-mp@map %>% addFullscreenControl()
mp

Figures in R

There are not many variables in these data. All we really have are the coordinates, the times and estimated elevation. We don’t really need the elevation as estimated by the GPS as we will read more accurate values from the DTM later. However just to illustrate plotting in R we can plot the elevation against time for these three groups (this will give a quick visualisation of the climbs and descents).

To do this we will use the ggplot package in R. It is slightly more difficult to write lines of code to produce ggplots than to use the base graphics in R. However it is much more powerful and consistent.

The first step is to set up what are known as the aesthetics (aes). The term aesthetics seems a little odd, as it does not refer to the appearance of the plot. It refers to the elements of the data frame that are going to be plotted.

g0<-ggplot(gps,aes(x=time,y=ele,color=group))

So we have an object that has taken data from the gps data frame. Our x values are taken from time, the y values are ele (elevation) and we’ll use different colours for each group.

Now to plot lines we add a geometry.

g0+geom_line() 

If a figure is to be used in a publication there are many tweaks that can be added to improve the appearance of the figures. However this quick default is OK for the purpose of just checking on the groups progress. Group f finished first. Group b descended the most. Group h spent the most time collecting the data.

Writing the data to a file

The GPS data was read into R as an sf spatial object. The rows could in fact have had a geometry in the form of lines, polygons or multiple points. In this case the geometry is a single point so we can write out a csv (comma separated variable) file

st_write(gps, "gps.csv", layer_options = "GEOMETRY=AS_XY", delete_dsn=TRUE)
## Deleting source `gps.csv' failed
## Writing layer `gps' to data source `gps.csv' using driver `CSV'
## options:        GEOMETRY=AS_XY 
## features:       76
## fields:         4
## geometry type:  Point

This file can be opened and edited in Excel.

Adding to a postgis database

Another more advanced method of saving the data would be to add it to an online data base for direct analysis is a program such as QGIS or Arc without the need to load the data from a file.

library(rpostgis)
conn <- dbConnect("PostgreSQL", host = "postgis", dbname = "bu" ,
                  user = "msc_student", password = "msc123")
st_write_db(conn,gps,table=c("gps","arne_field_trip_gps"),drop=TRUE)
dbDisconnect(conn)
## [1] TRUE

Use the postgis connector in QGIS with Host 172.16.49.31 Port 25432 database bu , Username msc_student and password msc123 to log on.

Reading point data from Excel

If we save a dataframe from Excel with points as longitudes (X) and latitudes (Y) we can read it back into R. We need to tell R which columns are used as the coordinates and which CRS is being used.

gps<-read.csv("gps.csv") ## Read back from the csv and overwrite old data
coordinates(gps) = ~X+Y  # Set coordinates
gps<-st_as_sf(gps)  # Make sf object
st_crs(gps) <-4326  # Set the crs

The only slight problem is that saving dates and times as Posixct format is not supported by CSV files, so we would have to manipulate the data slightly to get thi back, but its not necessary here.