Extracting values

set.seed(1) ## To ensure consistency of random points
library(rpostgis)
library(RPostgreSQL)
library(sp)
library(raster)
library(rgeos)
library(rgdal)
library(RColorBrewer)
library(leaflet.extras)
library(mapview)
library(sf)
library(ggplot2)

### Set some mapview options for base maps
mapviewOptions(basemaps =c("Esri.WorldImagery","OpenStreetMap","OpenTopoMap", "OpenStreetMap.BlackAndWhite", "OpenStreetMap.HOT"))
####################3

Introduction

In the last class we produced a set of derived raster layers from the digital terrain model and the digital surface model. These layers should all be saved in a directory named gis_data in your home directory. If they are not, then the following analysis won’t work, so you much check that you have all the files in the right place first.

Raster layers can be loaded into R from geotiff format quite simply as long as you take care to make sure the path to the file and the file name is typed correctly.

aspect<-raster("gis_data/aspect.tif")
slope<-raster("gis_data/slope.tif")
dtm<-raster("gis_data/dtm.tif")
dsm<-raster("gis_data/dsm.tif")
chm<-raster("gis_data/chm.tif")
chm10<-raster("gis_data/chm10.tif")
sol<-raster("gis_data/sol.tif")
twi<-raster("gis_data/twi.tif")
chm[is.na(chm)]<-0 ## Need to do this to avoid errors later as very low vegetation height values had been set to missing. Change them to zeros instead.
chm10[is.na(chm10)]<-0

So now we have all these layers in R memory ready to be used in analysis. We don’t need to actually plot them out nor to look at anything at this stage. If you check the Environment window in R studio you will see that they are all present and ready for use.

Generating some “fake” data

You should have your own CSV file with the GIS points, your measurements of the number of small trees and other covariates ready for analysis.

TO design an analysis I will generate 100 random points and make up some random numbers for the number of pines. There should be no particular relationship between the numbers of pines in my random data and any of the covariates. You might find something different when you analyse the real data you collected.

As the study area (Grip heath) was much smaller than the RSPB reserve I quickly digitised a polygon to represent the rough area and placed this in the PostGIS data base. This code will load in my polygon.

conn <- dbConnect("PostgreSQL", host = "postgis", dbname = "bu" ,
    user = "msc_student", password = "msc123")
query<-"select * from arne.study_area"
study_area<-st_read_db(conn, query=query)

### Also bring in the OSM paths vector layer for the study area
query<-"select osm_id, st_intersection(geom,way) geom from 
arne.study_area a,
dorset_line d
where st_intersects(geom,way)"

paths<-st_read_db(conn, query=query)


study_area <- st_transform(study_area, 27700)
paths <- st_transform(paths, 27700)
dbDisconnect(conn)

This is useful as I can place random points within the polygon and crop rasters to the bounds if it necessary to reduce their extent.

mp<-mapview(study_area) +mapview(paths)
mp@map <-mp@map  %>% addFullscreenControl() %>% addMiniMap(position = "bottomleft") 
mp

Random points

There are some technical aspects I will mention here that are only of any interest if you really want to know more about R (e.g. PhD students using R for their own work may take note). I have placed these comments in italics and you can safely ignore them if you are not interested.

Spatial objects in R were traditionally held as objects of the sp class. Several books have been written on the technical aspects of working with these data. However more recently the sf package has become available. Sf stands for “simple features”. Simple feature objects are simpler to understand, as they are just data tables with a geometry column holding the vertices of polygons, the nodes of lines or the coordinates of points. In a few years time these will almost certainly replace sp objects completely. However at the moment both data classes exist simultaneously and many functions only use the old forms. So as it stands it is often necessary to change between the two classes in order to use functions. This simply involves adding a line of code here and there The simple features vignette in the help files is rather technical but has some useful information on how spatial data is organised using this package.

http://r.bournemouth.ac.uk:8788/help/library/sf/doc/sf1.html

sp_study<-as(study_area, "Spatial") ## Coerce to sp class

Now I can generate some random points within the study area polygon. The function uses the old sp class.

### Generate our points by sampling randomly within the polygon. Note that there are other options for the sampling scheme.
rand_points<-spsample(sp_study, n=100, type='random')

We might quickly plot these out to get an idea what they look like.

plot(sp_study,axes=T)
plot(rand_points,add=T,pch=23,bg="red")

Generating the number of pines

A quick note about stats at this point, that wil be returned to in the Advanced Quantitative Methods unit when we look at generalised linear models. R can generate random numbers from a huge range of statistical distributions. These can be useful for testing models ideas. They are also used in some algorithms for direct analysis. Count data (and our measure is a count) can only be approximated by a normal distribution when count values are high. If zero counts are possible we expect the values to form a non-normal distribution a priori. So I will simulate values from a poisson distribution.

npines<-rpois(100,3) ## Gnerate 100 random integers from poisson
hist(npines)

This is a fairly realistic distribution for the count data.

Now we can add that variable to the generated points and form a spatial data frame, which can also be turned into an simple features (SF) object

rand_points<-SpatialPointsDataFrame(rand_points,data.frame(npines))
sf_rand_points<-st_as_sf(rand_points) ## Coerce to sf class

Try clicking on the sf_rand_points object in the top window in RStudio.

head(sf_rand_points)

## Simple feature collection with 6 features and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 397270.4 ymin: 87504.81 xmax: 397810.2 ymax: 87693.36
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
##   npines                       geometry
## 1      3 POINT(397515.203104844 8755...
## 2      2 POINT(397401.367027277 8761...
## 3      4 POINT(397810.225532101 8750...
## 4      2 POINT(397270.433222307 8757...
## 5      3 POINT(397764.286001004 8769...
## 6      1 POINT(397537.929277421 8764...

Saving and loading

The made up data can now be saved and treated as if they were the real data in order to design an analysis. If you just change the read in line to become the real study data you will be using that instead of this fake, random data.

Note that before saving I have changed the CRS to EPSG 4326 so the data will be in lat long. It should then be identical to the gps file with X and Y columns holding coordinates in latitude and lonitude. You should have some additional columns for other covarariates that you measured in the field. I have not added these here.

sf_rand_points<-st_transform(sf_rand_points,4326)
st_write(sf_rand_points, "gis_data/rand.csv", layer_options = "GEOMETRY=AS_XY", delete_dsn=TRUE)

## Deleting source `/home/rstudio/rstudio/qands/MsC_R_Spatial/Arne/gis_data/rand.csv' using driver `CSV'
## Writing layer `rand' to data source `/home/rstudio/rstudio/qands/MsC_R_Spatial/Arne/gis_data/rand.csv' using driver `CSV'
## options:        GEOMETRY=AS_XY 
## features:       100
## fields:         1
## geometry type:  Point

Look in the gis_data folder and you will find a csv file that you can open using Excel. This is a template for your own data file.

Analysing the study points

Now, if you subsitute your own file with your measurements for my random points (rand.csv) you should be able to use the rest of the R code in this document to run through and analyse your own data. I will help you with this process in the class. So, after loading in the rasters and the csv points this is your real starting point.

The first step is to read in the CSV file. Make sure that you change the name to the one you have produced yourselves!

study_points<-read.csv("gis_data/rand.csv")
## Read back from the csv and overwrite old data

Now, as an optional little trick I have added in some code that produces a link from each point to Google maps. This will work when we make the leaflet map. Don’t worry about the code here.

######## Optional bonus code: Add a clickable link to the point in Google maps
 google<-sprintf("http://maps.google.com/maps?&z=15&mrt=yp&t=k&q=%s+%s",study_points$Y,study_points$X)
study_points$google<-sprintf("<a href='%s'>Show on Google maps</a>",google)
#####################################################################33333

To turn the dataframe with the X and Y coordinates into a spatial points object in R we need to tell R what the columns are called with the coordinates. That forms an object of class sp. We will also make an object of the more modern class sf, which is generally easier to work with (but which still doesn’t work with some of the older functions).

coordinates(study_points) = ~X+Y  # Set coordinates
sf_study_points<-st_as_sf(study_points)  # Make sf object
st_crs(sf_study_points) <-4326 ## Set CRS

As the coordintes were in Latitude and Longitude (EPSG 4326) we will tranform them to national grid (EPSG 27700). Again, this is very simple in R.

sf_study_points<-st_transform(sf_study_points,27700) ## Transform to national grid
study_points<-as(sf_study_points, "Spatial") ## Remake a sp object to ensure its got the right CRS too

Extract all the values from raster layers

Extracting values from raster layers can take various forms. The simplest is to just find the pixel at the same position of the point, take its value and add it to the data table. This code will do that for all the raster layers we have loaded.

## Note that it is best to type raster::extract rather than just extract as there are other functions in R called extract and we want to make sure that the function from the raster package is used.
study_points$aspect<-raster::extract(aspect,study_points)
study_points$sol<-raster::extract(sol,study_points)
study_points$slope<-raster::extract(slope,study_points)
study_points$twi<-raster::extract(twi,study_points)
study_points$dtm<-raster::extract(dtm,study_points)
study_points$chm<-raster::extract(chm,study_points)
study_points$chm10<-raster::extract(chm10,study_points)

All the data has been added to the sp object, so we’ll reform the sf object too.

sf_study_points<-st_as_sf(study_points)  # remake the sf object

Click on this in the Environment pane in RStudio. You will see that all the values have now been added.

Extracting values around a point.

Another way of looking at the issue of finding values for a point is to take an area around the point and calculate some value based on all the pixels that fall into this buffer region. If there are a lot of points and very large raster layers this can take much longer. However it can make more sense to look at a neighbourhood around a point rather than the point itself.

If you recall, the chm10 layer is the canopy height layer averaged over around 10m rather than 2m (i.e. thinned by a factor of 5). So if we take a 5m buffer around the points and calculate the mean canopy height we will get a similar (but not identical) value.

We can do this in R using a rather longer code line (don’t worry about the syntax, just notice the inputs and the definition of the buffer radius in meters)

study_points$chm5b<-unlist(lapply(raster::extract(chm,study_points,buffer=5),mean))
sf_study_points<-st_as_sf(study_points)  # remake the sf object

Look at the comparison between the extracted values for chm10. If you want to calculate elements such as the slope around a point rather than the slope precisely at a point this is the way to do it. You might want to carefully copy the lines above and change the input variable and maybe the value of the buffer.

Clipping the canopy height model raster

We are now going to use the canopy height model (chm) to automatically derive a layer holding the position of trees. However this layer extends far beyond the study site so some of the calculations would take time to run on the whole raster. So we will first crop the raster to the bounds of the study area.

chm_study<-crop(chm, sp_study) ## Use the sp class object

Vectorising the trees

Distance calculations are much easier to run when both layers incolved are vectors. We will define “trees” as areas with a canopy height over 2 meters and vectorise (draw around) all the pixels that are selected by this definition.

chm_study[chm_study<2]<-NA ## Set below 2m as blanks (NA)
chm_study[chm_study>=2]<-1 ## Set above 2m to a single value
chm_vect<-rasterToPolygons(chm_study, dissolve=TRUE) ## Vectorise

The chm_vect layer is a multipolygon of the boundaries around the clusters of pixels. Let’s take a quick look at it using a default plot.

plot(chm_vect,axes=T)

As this layer has a multipolygon geometry if we calculate distances between the study points and the layer we will only get one value (the closest distance). Think in terms of distances to the coastline of British Isles. So if we want the distances between the points and all the polygons we need to split the layer up into single polygons and have multiple rows.

sf_chm<-st_as_sf(chm_vect) ## Change to an sf object
sf_chm<-st_transform(sf_chm,27700) ## Just to ensure the exact same CRS string --- it was in 27700 but needed the epsg code

sf_chm<-st_cast(sf_chm,"POLYGON")

## Warning in st_cast.sf(sf_chm, "POLYGON"): repeating attributes for all sub-
## geometries for which they may not be constant

Now look at the sf_chm object. It is now in the form of single polygons. You will see a warning about the attributes being repeated. This is OK in this case, but think what the warning implies. If you split up a multipolygon of the British isles the area of each polygon would still be the area of the British isles, which would not necesssarily make sense. It is worth being aware of these sorts of issues, as they commonly arise in GIS work. They can be easily fixed, as you would just recalculate the areas, providing that you realised what was happening. There could be errors if you don’t, hence the warning. A GIS can’t guess new values when a multipolygon is split, without being told. The same applies to Arc and QGIS.

Forming a distance matrix

We can now ask R to calculate all the distances that are possible between every single point and every single polygon. In some cases this could lead to a huge matrix, so this approach won’t always be practical for very large amounts of data. However it will work well here.

distances<-st_distance(sf_study_points,sf_chm)

OK so that’s done, but we don’t need to look at this object as it consists of 166500 units. It is just held in memory and can now be used for other calculations.

Now, if we want to find the distance between our study points and the nearest tree we will need to find the minimum distance in each rown of our matrix. We can do that in R using the following code.

sf_study_points$min_dist<-apply(distances,1,min)

If we want some other type of calculation we can write any sort of function and apply it to the matrix. So, this function counts the number of polygons (trees) within a distance of 50 meters.

f50<-function(x)length(x[x<50])
sf_study_points$n50m<-apply(distances,1,f50)

Now look again at the sf_study_points object. We have the two extra columns added to it. We can write this all out again and we end up with a csv data file with all these derived variables added in. You might like to carefully alter the codeline to derive counts of trees within different distances. For example 100 m

f100<-function(x)length(x[x<100])
sf_study_points$n100m<-apply(distances,1,f100)

Distance to paths

The same analysis can add in the distance to the nearest path.

paths<-st_transform(paths,27700)
distances_p<-st_distance(sf_study_points,paths)
sf_study_points$paths_dist<-apply(distances_p,1,min)

Conclusions

Although all this R code may look rather complicated, think about how much time this has all saved. If you wanted to measure these properties in the field it would have involved a lot of additional field work. Calculating the results in a point and click GIS would also take some time. With a little care and attention to the input data all the R code can now be reused to carry out exactly the same calculations at other sites for which Lidar data is available.

Write out the data file

We can now form a CSV file and a shapefile with all the variable columns added as attributes.

st_write(sf_study_points, "gis_data/study_points.csv", layer_options = "GEOMETRY=AS_XY", delete_dsn=TRUE)

## ignoring columns with unsupported type:
## [1] "X.1"
## Deleting source `/home/rstudio/rstudio/qands/MsC_R_Spatial/Arne/gis_data/study_points.csv' using driver `CSV'
## Writing layer `study_points' to data source `/home/rstudio/rstudio/qands/MsC_R_Spatial/Arne/gis_data/study_points.csv' using driver `CSV'
## options:        GEOMETRY=AS_XY 
## features:       100
## fields:         14
## geometry type:  Point

## Write out study points as shapefile for direct use in another GIS

study_points<-as(sf_study_points, "Spatial")
writeOGR(study_points,"gis_data","study_point_shapefile","ESRI Shapefile",overwrite_layer =TRUE)

## Write chm_vect as shapefile 
chm_vect_sp<-as(chm_vect, "Spatial")
writeOGR(chm_vect,"gis_data","chm_vect","ESRI Shapefile",overwrite_layer=TRUE)

Mapview of the results

Because the derived vector layer for the trees was imperfect there will be some errors in the results. This is not necessarily particularly important for a study of this type. All studies have some measurement errors, and statistical analysis is designed in part to take this into account. So, for example, the count of the number of trees within 50m might not actually represent the precise number of trees, but it would be closely correlated with this quantity. It would represent an index, rather than a precise measurement. This can be acceptable in many situations, providing that you fully document your methodology to avoid any confusion.

It is however a good idea to always check your results for general “sanity” by looking at the data in context. Making a mapview web map helps with this process.

If the vectorised layer for the trees is very different from the visible pattern shown in the satellite image an alternative approach would be to digitise each visible tree as a point and calculate the distance matrix using this layer instead.

mp<-mapview(sf_study_points,color="red") + mapview(chm_vect)
mp@map <-mp@map %>% addFullscreenControl() %>% addMeasure(
position = "bottomleft", primaryLengthUnit = "meters", primaryAreaUnit = "sqmeters")
mp

Further work

So your task now is to adjust the input line, read in your data and explore the results in the leaflet map and GIS.

Try downloading all the layers in the gis_data* folder and form a desktop project in either Arc or QGIS. Because a desktopp GIS has a point and click interfaces it is easier to explore further processing options on your own using the options in the menus. However you will find that step by step data processing in a desktop GIS tends to lead to forming many intermediate files and becomes rather tedious. Using a pre-built R script is much quicker.

Statistical analysis

The result of this data processing is a data frame with columns representing variables of interest. This is already loaded into memory in R, so the statistical analysis and the GIS data processing are linked together seamlessly when using R. We will start looking at statistics next week, but just to illustrate the idea, the following code will form a scatterplot of the number of pines against the distance to the nearest tree and fit a regresion line through the points. As I made up the numbers at random I would not expect to find any relationship.

g0<-ggplot(sf_study_points,aes(y=npines, x=min_dist))
g0<-g0+geom_point()+geom_smooth(method="lm")
g0

Just as expected, there is no relationship between the two variables. You will see how to formally test for relationships such as this through fitting statistical models in the next few weeks.