Introduction

In recent years the R language for statistical computing has expanded greatly in scope. One of the technical features of R that is largely invisible to users, but well understood by R developers, is the fact that R packages can actually use code from a wide range of other pieces of software. This is achieved through compiling them against libraries of code written in a wide range of other computer languages. Many R functions are really just “wrappers” that use common R syntax to pass data onto other functions. A result of this is that R can be used as a fully functional platform for running operations that are also available through a point and click interface in Geographical Information Systems. At BU we use two popular GIS programs. Esri Arc map and QGIS. Esri programs are closed source, so the actual functions used in Arc are not available to R. However the data formats used by the program can be read and written by R. QGIS is open source and relies quite heavily on a library of functions provided by the Geospatial Data Abstraction Library (GDAL). GDAL can read, write and process geographical data in a huge range of formats, as can be seen by googling gdal and looking at the returned web pages.

Another powerful open source tool for conducting GIS is PostGIS. Again a google of the name will return a huge number of pages of information on the tool. PostGIS is a spatial database, or rather it is a set of tools that extend the widely used relational database Postgresql in order to handle spatial data and spatial queries. Postgresql, like other similar relational data bases such as MySQL, resides on a server and as such it is invisible even when running locally. Client applications connect to this server to load, manipulate and query the information stored on it. This is how a large institutional data base works, such as for example the database used by a firm like Amazon for online retail. Amazon will have one central database that store information on all the items in stock, the customers who purchase them, and the position of items that have been despached. This is truly the realm of “big data”, but a database such as postgresql with a spatial extension can potentially play a similar role for an Environmental agency, through storing and updating spatial information. Some of this information, such as the position of organisms being tracked by GPS, may be updated in real time. As R can “talk to” Postgis the opportunity exists to design advanced statistical analyses of animal behaviour that update as new data comes in. We will not be looking at anything as advanced on this particular course, but you should be aware from the outset that this is a powerful, industry standard, piece of software and knowledge of the concepts behind processing “big” spatial data is highly sought after in the modern workplace.

Finding and using spatial data

You have probably used a program such as Arc to digitise map layers before, either on this course or on others. In the early days of GIS a great deal of data had to be digitised and captured “by hand” in this way. However today a great deal of that work has been done, although there is still a need for ditising new data as things change. Think for example of drawing around imagery captured by drones of landslips and flood damge. Remote sensing is however typically used as an automated tool for producing new spatial layers. There is a vast array of data sources available on line that contain spatial layers derived from remotely sensed data and more direct manual input. GPS trackinig also provides data as direct input to Geographical Information System. However if you have googled Gdal, QGIS, Arc or PostGIS you will very quickly find that in addition to a vast array of data there is also a vast array of formats for these data. So one of the first challenges when using any GIS is pulling together these data sources into a common framework so the information they contain can be analysed together.

If an institution uses a mixture of derived products from data providers and “in house” data that is produced for a specific purpose they need a framework to combine these multiple sources. PostGIS is designed with this in mind, as it has the capacity to read in data in almost any format. This is a rather technical subject that is not really of great interest to environmental scientists directly, but you should be aware that a significant portion of the time spent by GIS professionals on their projects involves manipulating and reformatting data sources into this common framework. For this introductory exercise three data sources have been used.

  1. Lidar derived digital surface and digital terrain model rasters at 2m resolution downloaded from this source. http://environment.data.gov.uk/ds/survey/index.jsp#/survey?grid=SZ39
  2. Open Street Map vector layers from here. http://download.geofabrik.de/europe/great-britain/england/dorset-latest.osm.pbf
  3. Natural England’s polygon layers showing protected areas and other areas of interst from here. http://naturalengland-defra.opendata.arcgis.com

These layers are already uploaded into a PostGIS data base for use in the class. Notice that if you do go to the sites linked above you will find that to incorporate the data into an Arc project may be rather time consuming, as data have to be downloaded, extracted from zip files and loaded into Arc. Many of the layers are national in extent so include information that is not relevant when mapping a specific site. So further work is needed to extract parts of the data of specific interest. A spatial data base uses spatial indices to speed up the process of querying large data sets in order to pull out specific features.

Vector vs raster data

You may already be familiar with the concept of raster and vector data, but a reminder of the distinction may still be useful. Vector data consists of spatial features with a corresponding attribute table. The features can be

  1. Points
  2. Lines
  3. Polygons

We can also have multipoint layers, multiline layers and multipolygons. For example think of the United Kingdom. One way of storing polygons representing the Great Britain could be as a series of individual polygons representing each island separately. This would be needed if, for example, we were interested in the area of the Isle of Skye or Brownsea. SO we could use single polygons for each island. Another possible approach would be to hold all the attributes of the country together, such as GDP, population etc. In this case we would use multiple polygons to represent the spatial entity that the attributes correspond to.

Raster data is a form of image. Common non spatial data formats such as jpeg and png are essentially rasters and can be handled as such. A raster layer consists of a set of cells with attributes. A layer can have multiple bands. So a normal photograph has three layers (red, green and blue) that when combined form a coloured image. Remotely sensed data can have more than three layers. Single band rasters such as a digital elevation model contain a numerical value for each cell that represents the elevation within the grid. Rasters scale consists of two elements. The extent and the grain size. So if we are looking at elevation over the whole of the UK we may use a rather coarse grid at 90m resolution (eg srtm data). If we are looking at a site with a small extent we would want much finer resolution. Imagery with RGB bands might not have any directly relevant interpretation for the numerical values contained in the bands. So they may be useful as a backdrop for visualising an area but less useful for direct data processing. These concepts will all be illustrated using the example in this class.

Using R

In case you are reading this before it has all been explained in the class you should be aware that in this introductory course you will never be expected to actually type a line of R code yourself. All the code to run the operations is shown in full in the documentation. This can look daunting at first sight, particularly when some of the output from the code lines is shown. However R code lines are simply instructions to get things done. With a little experience it is quite easy to carefully modify these instructions to produced different results. You should step through each code chunk in RStudio carefully as will be explained to you in the class in order to get the results shown in the document, thinking about what the code has done as you go, rather than worrying about the details of the syntax. At the end of the class you can knit the document to produce a copy of the HTML web page yourself. This serves a template for modification later when you need to the code yourself. You need to think carefully about issues such as the size of rasters and the CRS

Arne RSPB reserve

The class will focus on the RSPB reserve at Arne. Arne actuall consists of several SSSI site units. Let’s get the vector data that represents these units from the spatial data base.

This involves two steps. First we connect to the database in order to make the data available. A database connection can be made secure through a user name and password. Some users may be allowed access to some, but not all, of the data if sensitive information is stored on the database. Users may be allowed to query, but not change data. Other administrators or “super users” may have permission to do whatever they want, including deleting the whole database! With great power comes great responsibility.

## R packages needed
library(rpostgis)
library(RPostgreSQL)
library(sp)
library(raster)
library(rgeos)

We will connect as a generic “bu_user”. This allows you to query the data base but not make major changes.

conn <- dbConnect("PostgreSQL", host = "postgis", dbname = "bu" ,
    user = "bu_user", password = "bu_user")

OK. We now have a connection to the shared data base. The same details can be used to connect to the database in QGIS, which we will use in the practical class later.

We can list the tables in the data base directly from R.

dbListTables(conn)
##  [1] "planet_osm_nodes"                            
##  [2] "planet_osm_polygon"                          
##  [3] "ag_landc_p1988"                              
##  [4] "agri_land_class"                             
##  [5] "country_parks_england"                       
##  [6] "biosphere_reserves_england"                  
##  [7] "dorset_line"                                 
##  [8] "dsm_arne_hh"                                 
##  [9] "hampshire_line"                              
## [10] "hampshire_nodes"                             
## [11] "dorset_point"                                
## [12] "dorset_polygon"                              
## [13] "hampshire_polygon"                           
## [14] "hampshire_rels"                              
## [15] "dorset_ways"                                 
## [16] "dorset_nodes"                                
## [17] "dorset_rels"                                 
## [18] "dorset_roads"                                
## [19] "ancient_woodlands_england"                   
## [20] "aonb"                                        
## [21] "arne_dsm"                                    
## [22] "arne_dtm"                                    
## [23] "planet_osm_roads"                            
## [24] "planet_osm_ways"                             
## [25] "potential_special_protection_areas_england"  
## [26] "proposed_ramsar_england"                     
## [27] "ramsar_england"                              
## [28] "sites_of_special_scientific_interest_england"
## [29] "spatial_ref_sys"                             
## [30] "special_areas_of_conservation_england"       
## [31] "special_protection_areas_england"            
## [32] "planet_osm_rels"                             
## [33] "inspire"                                     
## [34] "sssi_units_england"                          
## [35] "hampshire_point"                             
## [36] "hampshire_roads"                             
## [37] "hampshire_ways"                              
## [38] "national_nature_reserves_england"            
## [39] "o_100_dsm_arne_hh"                           
## [40] "gadm2"                                       
## [41] "tnc_terrecor"                                
## [42] "local_authorities"                           
## [43] "o_100_prec"                                  
## [44] "o_100_srtm"                                  
## [45] "srtm"                                        
## [46] "UK"                                          
## [47] "o_100_dtm_arne_hh"                           
## [48] "dtm_arne_hh"                                 
## [49] "dtm_arne_hh_test"

So there is quite a lot of information, although this is a fairly small data base. You would not be expected to know what is in each of these tables without reference to some supporting information (metadata) that would be provided by the data base maintainer.
Larger data bases can be organised in the form of schemas, each of which contains specific data.

The sssi_units_england table looks as if it should be what we need. This is the data provided by Natural England. We can list the fields to find out their names.

dbListFields(conn,"sssi_units_england")
##  [1] "gid"        "objectid"   "sssi_name"  "ensis_id"   "number"    
##  [6] "id"         "unit_area"  "unit_perim" "easting"    "northing"  
## [11] "grid_ref"   "poly_area"  "poly_perim" "easting0"   "northing0" 
## [16] "reference"  "condition"  "cond_date"  "__gid"      "gis_file"  
## [21] "modified"   "dataversio" "gis_date"   "mapversion" "status"    
## [26] "shape_leng" "shape_area" "geom"

In the practical class you will see this information directly in the QGIS interface.

Finding Arne

A typical operation in a spatial data base is finding all the records within a certain distance of some point. These sort of operations can be run iteratively with multiple points over very large areas in order to extract statistics. Let’s find Arne. You can do that using Google maps and add the latitude and longitude.

If we were building some sort of online application using the data base finding the coordinates might be done through clicking on a map to capture the coordinates directly. In this case I’ve typed the information into the box below. I’ve also added the name of the table and the name of the column within the table.

Now the rather nasty looking line of code after this build up a query that is passed to the database.

lat<- 50.69
lon<--2.05
table_name<-"sssi_units_england"
column_name<- "sssi_name"  

query<-sprintf("select %s from %s where st_dwithin(geom,st_transform(st_setsrid(st_makepoint(%s, %s),4326),27700),1000)",column_name,table_name,lon,lat)

dbGetQuery(conn,query)
##        sssi_name
## 1           Arne
## 2           Arne
## 3           Arne
## 4           Arne
## 5           Arne
## 6           Arne
## 7           Arne
## 8           Arne
## 9  Poole Harbour
## 10          Arne
## 11 Poole Harbour

I’m not expecting you to learn spatial SQL in this class! However there a few elements here that do need some explanation. Here’s the query that was built and sent to the data base.

select sssi_name from sssi_units_england where st_dwithin(geom,st_transform(st_setsrid(st_makepoint(-2.05, 50.69),4326),27700),1000)

Although it looks complicated its actually fairly simple to understand. The query asks postgis to return the data held just in the column sssi_name in the table sssi_units_england. That is what the select sssi_name from ssi_units_england part does. If we just left it at that the query would return all the names in the table. We don’t want to see all of them. What the query asks for is the names within 1km (1000m) of the longitude and latitude values that we entered. The st_makepoint(-2.05, 50.69) part makes a point into a geometry, which is the format in which data is stored in postgis. The reason a few more nested functions are included brings us onto a very important aspect of GIS. The Coordinate Reference System or CRS. This needs explaining.

Coordinate reference systems

I can’t place enough stress on the absolute necesesity to understand coordinate reference systems when carrying out any GIS work. Almost all the errors and frustrations of things not working as expected tend to come down to errors with the CRS. The reason for this is probably that when GIS is taught in a typical introductory course the tutor makes sure from the outset that all the demo data is provided in the same CRS. This prevents any errors from ocurring. However, in real life, data comes in different coordinate reference systems from different sources.

The reason that there are many CRSs comes down to basic cartography. The world is not flat. Therefore the maps that we usually use are projected in some way. The only unprojected CRS involves latitudes and longitudes, and even in this case the actual value depends on the datum used. In the UK the typical projected map uses British National Grid. The clue to why this could be a problem is in the name. Other countries may have their own grid system and there are also a huge range of alternative ways of projecting coordinates into two dimensions. There is a lot of information online about this, which I won’t repeat.

The technical definition of a CRS is quite complex and involves quite a few separate parameters. To make things simpler and more consistent the European Petroleum Survey Group (EPSG) came up with a table of CRS definitions which they gave a code number to. The British National Grid is EPSG code 27700. Unprojected latitudes and longitudes using a standard global datums (WSG 84) are EPSG code 4326. So this bit of the query now may make sense st_transform(st_setsrid(st_makepoint(-2.05, 50.69),4326),27700)

What it is saying to postgis is “make a point, set the point’s srid (CRS) to EPSG 4326, then transform it to EPSG 27700”. We need to do that so the point matches the CRS of the Natural England layer, which is held as British National Grid coordinates. Nasty to look at, but absolutely necessary unless we made the point using national grid numbers in the first place. Both PostGIS and R use a software library called Proj4 to carry out CRS transforms. So, as long as you know the EPSG code that data are held in (or can guess it from context) you can get all your data into the same CRS. This is essential, as if not it doesn’t line up. Some software such as QGIS carries out so called “on the fly” reprojection when visualising data in different CRSs in order to make the maps line up. However this can be dangerous in some cases as if the layers are still in different CRSs spatial operations won’t work, Things can get particularly confusing if a country changes the datum it uses for their national data, as historical layers won’t match modern ones. So get to know your EPSG codes if you work in different countries and always find out what CRS has been used for any data you obtain. Providing you know, there is no problem transforming. Two dimensional Google maps use a rather esoteric Web Mercator projection (EPSG 3857) that is convenient because it works for the whole planet (unlike British National Grid of course) but actually leads to errors in measurements of distance and areas. If you need highly accurate measurements PostGIS has a sophisticated mechanism of coercing points into “geography” format that allows calculations to be made based on measurements on a sphere. Imagine placing a piece of string on a globe to measure distances. These are technical elements of a GIS that you need to be aware of, even if you don’t use them directly yourself.

Bringing spatial data from PostGIS into R

OK. To recap. At this point we’ve asked for the names of the tables in our data base. We’ve asked for the names of the columns and we’ve queried a column to find out the names of the sssi units within a 1 km radius of our latitude and longitde. There are in fact many spatial queries that can be run directly on the server as spatial SQL (Structured Query Language). If we wanted to know something like “how many SSSIs fall within areas of outstanding national beauty” we could, if we knew some spatial SQL, find out. The advantage of a spatial data base is that as field are indexed it handles large amounts of data very effectively. However you can’t actually see anything when querying the database. You don’t have a map to look at (yet). All the data is out there in some rather abstract “server space”. So let’s get it out of there and into R so we can plot it.

This is easy now we know the name of our sssi’s, which unsuprisingly is Arne.

query<-"select  * from sssi_units_england where sssi_name = 'Arne'"
arne <- pgGetGeom(conn, query=query)

Nothing seems to have happened, but we’ve now got our data object into R. This consists of two components. The geometries that are used to draw it and the data table with the attribute.

We can extract the attribute table from the object using the @ operator.

arne_data<-arne@data
str(arne_data)
## 'data.frame':    18 obs. of  27 variables:
##  $ gid       : int  1763 3646 3647 4759 5829 6424 8646 6941 7395 7794 ...
##  $ objectid  : int  1763 3646 3647 4759 5829 6424 8646 6941 7395 7794 ...
##  $ sssi_name : chr  "Arne" "Arne" "Arne" "Arne" ...
##  $ ensis_id  : num  1e+06 1e+06 1e+06 1e+06 1e+06 ...
##  $ number    : num  1 2 2 3 4 5 11 6 7 8 ...
##  $ id        : num  1005800 1005786 1005786 1005819 1005805 ...
##  $ unit_area : num  16.18 11.37 11.37 1.13 72.94 ...
##  $ unit_perim: num  2049 2792 2792 1755 4286 ...
##  $ easting   : num  398193 397810 397810 396910 397217 ...
##  $ northing  : num  89496 89529 89529 89203 89048 ...
##  $ grid_ref  : chr  "SY981894" "SY978895" "SY978895" "SY969892" ...
##  $ poly_area : num  16.18 9.43 1.95 1.13 72.94 ...
##  $ poly_perim: num  2049 1493 1300 1755 4286 ...
##  $ easting0  : num  398193 397875 397495 396910 397217 ...
##  $ northing0 : num  89496 89523 89557 89203 89048 ...
##  $ reference : chr  "SY981894" "SY978895" "SY974895" "SY969892" ...
##  $ condition : chr  "UNFAVOURABLE RECOVERING" "UNFAVOURABLE RECOVERING" "UNFAVOURABLE RECOVERING" "FAVOURABLE" ...
##  $ cond_date : Date, format: "2009-08-05" "2009-05-15" ...
##  $ __gid     : num  1072089 1072099 1072099 1072100 1072101 ...
##  $ gis_file  : chr  NA NA NA NA ...
##  $ modified  : Date, format: "2004-01-20" "2004-01-21" ...
##  $ dataversio: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ gis_date  : Date, format: NA NA ...
##  $ mapversion: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ status    : chr  "Current" "Current" "Current" "Current" ...
##  $ shape_leng: num  2049 1493 1300 1755 4286 ...
##  $ shape_area: num  161831 94262 19460 11300 729375 ...

There are a lot of fields here that are not particulary useful. If we want to simplify the data we could have selected the fields we wanted when building the query.

query<-"select  gid,unit_area,condition,geom from sssi_units_england where sssi_name = 'Arne'"
arne <- pgGetGeom(conn, query=query)
arne_data<-arne@data
str(arne_data)
## 'data.frame':    18 obs. of  3 variables:
##  $ gid      : int  1763 3646 3647 4759 5829 6424 8646 6941 7395 7794 ...
##  $ unit_area: num  16.18 11.37 11.37 1.13 72.94 ...
##  $ condition: chr  "UNFAVOURABLE RECOVERING" "UNFAVOURABLE RECOVERING" "UNFAVOURABLE RECOVERING" "FAVOURABLE" ...

Notice that the query selected the geometry column (geom). This is the equivalent of the .shp part of an ESRI shapefile. The attributes are the equvalent of the .dbf part.

Plot the spatial object

We are going to eventually add these polygons into a zoomable web map in order to visually explore the area. However we can just make a simple plot of the data in R first to check that everything is working as we expected.

plot(arne,axes=T)

OK. so far so good. We have our first building block of data in memory.

Let’s start to build the zoomable map using leaflet.

Leaflet maps

Leaflet web maps consist of a single or several basemap layers and overlay layers that display spatial layers. Leaflet is actually written in Java, but R provides a simple way of building up maps using the %>% operator.

Before adding polygons to a leaflet map they must be in a suitable global CRS. We’ll transform from British national grid to Lat Lon EPSG 4326

arne4326 <- spTransform(arne, CRS("+init=epsg:4326"))

Now we can form a very simple default map with a few lines

library(leaflet)

arne_map<-leaflet() %>% 
  addTiles() %>%
  addPolygons(data=arne4326)
arne_map

This is very simple. The addTiles line sets up an Open Street Map base map. The addPolygons adds in the Arne layer. However although this is a slippy web map, there is no way of switching off the overlay.

We can make a slightly more useful map by adding a control and a satelite basemap. To do this we give each layer a group name for use by the control.

 arne_map<-leaflet() %>% 
  addTiles(group="OSM") %>%
  addPolygons(data=arne4326,group="Arne SSSIs") %>%
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
  addLayersControl(baseGroups = c("OSM","Satellite"),overlayGroups = c("Arne SSSIs"))

arne_map

Now you can zoom in an out and switch between layers. The leaflet map is an R data object that can be saved and loaded into new projects.

Cleaning up

Diconnect from the data base.

dbDisconnect(conn)
## [1] TRUE

Save the workspace for the next class.

save(list=ls(),file="spatial_class1.rob")