Introduction

Today we’ll start on the more quantitative part of the unit. The unit aims to equip you with some of the key skills needed to analyse the sort of data you are likely to collect for your dissertation and skills that are highly valued in the workplace and research. Linking GIS to data analysis more generally is an important element in this.

Anyone carrying out research is keenly aware of the importance of quantitative analysis. However many students, and even some researchers, approach quantitative analysis with some trepidation. It is a vast and potentially complex field. The most common worry revolves around statistics. Frequently asked questions include “Which test should I choose for my data?” “How do I know that I have chosen the right test?” “What does the test tell me?” “How do I interpret the results?”

There are all important questions to ask. However some of the worries about statistics may be the result of over emphaisising the word “test”. The most important question is in fact “how can I use all the data I have available to address a meaningful research question?” This may, or it may not, involve applying statistical tests. It almost certainly will involve visualising your data, either in the form of maps, figures or both. It will always involve getting to know the nature of your data in depth and communicating your knowledge to other people in a direct and meaningful manner. Sometimes statistical tests and models are absolutely essential. In other cases they may not be useful at all. Sometimes you may be able to select an appropriate analysis quite easily from a set of standard, commonly applied methods. In other cases you may need to seek expert advice from an experienced researcher ,or statistician, in order to extract the most from your data, avoid pitfalls and fully justify your conclusions. Sometimes a simple shaded map, or a few well chosen figures, may answer the research question directly. Sometimes that approach would just be dismissed as being “merely descriptive” in which case you would be expected to have applied much more sophisticated inferential statistics. It all depends on the nature of the research and the nature of the data at hand.

In a single semester it is impossible to teach all the methods that may be needed in order to analyse a data set of the sort collected for a dissertation. Observational data sets are typically very esoteric in nature, and so tend to be challenging to analyse and may require quite advanced techniques. Careful thought and some creativity is usually needed to conduct any meaningful analysis. The more statistical methods that you know about, the more likely it will be that you will find an appropriate analytical tool to address your research question. Often you will need some help to find an appropriate method, but if you are aware of the concepts behind inferential statistics you can frame better questions.

What does quantitative data consist of?

This question appears at first sight rather trivial. Surely the key is in the name? Quantitative data consists of quantities, i.e. numbers, so we need to do some sort of maths to analyse them. However this simple definition hides many subtleties and potential pitfalls, as we will see It is often far from clear what sort of variables we are dealing with and how they are defined.

Variables are the column headers in a correctly formatted data frame. There are fundamentally two sorts of entries that can be typed into the rows below the column headers. Text and numbers. Let’s keep it all very simple for now.

If a column contains rows of text then the variable is categorical, so it will be considered to be a “factor”. If the column contains rows of numbers, then the variable is a numerical variable of some form. There are many variations on this basic theme to consider, but once you have gained some clarity about this basic concept, many elements of data analysis will fall into place.

So:

Factors have as many levels as there are unique entries in the data column. So if the header were “gender” we would have a categorical variable (factor) with two levels. If it is site we will have as many levels as sites.

We could go onto to a mroe detailed break down of the differences between various forms of numerical variables, but we’ll leave it at that until later.

Data visualisation vs statistical analysis

Traditional statistical text books place probably place too much emphasis on the statistics, and too little emphasis on producing good clear data visualisations. If a picture is worth a thousand words it is also worth at least a thousand confusing and incomprehensible statistics. Good figures are often the most informative and useful elements in any modern data analysis.

This is good news, as most people find figures much easier to understand than raw numbers. Once you become aware that a figure is not only easier to understand than a statistical test but that it is also often the best way to answer the key scientific question, some of the worry surrounding the issue of statistics can be removed.

A simple example

Let’s frame a very straightforward scientific question and try to answer it with simple data.

“How have global atmospheric CO2 concentrations changed since the start of the industrial revolution?”

Data available from here.

https://www.co2.earth/historical-co2-datasets

Graphical data analysis

Let’s read the data into R. Its on the server in a folder beneath your working directory. This will read it.

d<-read.csv("/home/msc_scripts/data/co2.csv")

If you want to save the file in your home directory you could make a folder called “data” and run this code to save it.

write.csv(d,"data/co2.csv")

Forming a plot in R

Throughout this course we will use the modern graphics facility in R provided by the ggplot2 library as much as possible. This provides a unified approach to forming figures, that is slightly more difficult to learn to use than base graphics, but much more powerful.

These lines load the library and set up a theme for the plots. There are various themes that change the overall look of the plots. The default theme has a grey background, and it is often better to have a cleaner looking black and white theme.

library(ggplot2)
theme_set(theme_bw()) ## Set the theme to black and white

Now, to set up a plot we first define the aesthetics. The aesthetics refer to the way the variables are mapped onto the plotting scheme. We often have an x and y axis. This line sets these up.

g0<-ggplot(d,aes(x=year,y=co2))  ## Set up the base plot

Nothing seems to have happened. However the object is being built up ready for plotting. This can then be used in various ways.

Now we can use a “geometry” to form a plot wth a line. Geom_line does this.

g0<-g0+geom_line()

When we type the name of the object it will now be plotted.

g0

However we have just the variable names along the axes. It might be better to be more explicit about what is being plotted.

g0<-g0+ylab("Atmospheric Co2 concentrations (ppm)")
g0<-g0+xlab("Year")
g0

There are a wide range of different geometries that can be used to add to the plot. I will provide example script lines for many of these that can be reused to make other plots. Say we need a reference vertical line in red to mark the start of the industrial revolution. We can add that using another line.

g0<- g0 + geom_vline(xintercept =1790,colour="red")
g0

This is not a very sophisticated figure, and a very similar figure could have been produced using Excel or similar software. However the nice thing about R is that the code lines set everything needed to make the graph and they are reusable. If the initial setup line used a different set of data the figure would all be redrawn in the same way.

The point is that the figure alone does tell most of the story in this case. It is quite apparent that atmospheric CO2 concentrations have increased over time. The red line added to the figure represents the start of the industrial revolution and we can see a consistent increase in concentrations dates from around this time. We do not need to run any additional statistical tests in order to address the question using these particular data. All that is required is one clear figure, which conveys the message.

This is not to imply that atmospheric scientists do not use inferential statistics, nor that a complete, formal analysis of the evidence for increasing levels of C02 in the atmosphere would not be based on some form of statistical modelling. However the raw numbers included in this particular data set are aggregated measures of global CO2 concentrations. They are not appropriate input to any form of inferential statistical test.

We use formal statistical inference when we do need to formally quantify uncertainty regarding the evidence we are using to address a scientific question. In this particular case there are many areas of genuine uncertainty, but we cannot address these through analysing the raw numbers as provided in this simple data set alone. Uncertainty arises due to the use of proxy measurements for historical CO2 levels rather than direct instrumented measurement. This is totally invisible in the data set. The data as given provides no measure of potential variability in CO2 levels from place to place. The data form a simple time series. Although the frequently used statistical techniques that you will learn more about on this course, such as linear regression, can sometimes be applied to time series, time series data points are not independent of each other. If CO2 concentrations are increasing year on year then the levels one year will depend on the level the previous year. So, the series of points lacks both replication and independence. Replication and independence are basic requirements for many statistical methods, although there are tools available to deal with some lack of independence. Data such as these are not suitable as input to standard statistical tests nor to conventional statistical models.

It is worth pointing this out right from the outset. Don’t jump to the use of inferential statistics until you need to use them. Similar situations often arise when data is collected for a dissertation but they may not be spotted as easily. “Which statistical test do I need to use to tell if there is a correlation between industrialisation and CO2 concentrations?” is totally the wrong question to ask of these data as they stand. Furthermore, even if it were possible to use a statistical test, setting up a null hypothesis of no difference between pre and post industrialisation atmospheric CO2 concentrations would make no sense, as there is a clearly understood and independently measurable process (CO2 emissions) that is responsible for the change.

A more complex time series

To illustrate the point further, look at this more detailed time series. It also consists of measurements of atmospheric CO2 concentrations. There were taken at the Mauna Loa observatory in Hawaii.

data(co2)
plot(co2)

You can easily see the general increasing trend. However there is also some consistent seasonal variability imposed on top of the trend. This arises because vegetation in the Northern Hemisphere takes up CO2 when it is growing in the spring and summer, and releases CO2 through decomposition of organic material during the winter. So we have a more complete data set and we may be able to answer more complex questions, such as how much interanual fluctuation exists in CO2 concentrations and whether these seasonal fluctuation have changed over time. These are not questions that involve purely random noise, so once again they are not suitable for simple, text book, introductory statistical tests.

Some high level code to break down the data

If this course were in fact to be aimed at teaching methods for analysing seasonal dependence in time series we would now go into great depth regarding a set of appropriate and specific analyses that could be applied to these sort of data. It is not, so we won’t do that at this point. However, just to illustrate the power that comes from being able to get the data into R, look at the ease with which we can run quite a sophisticated visual breakdown of the data. It simply needs a single line of R code.

plot(stl(co2,s.window=12))

We now have a complex figure that shows the raw data in the top panel, the fluctuating seasonal pattern in the second, the underlying trend in the third, and residual, random, “white noise” variability in the bottom panel. The only strictly stocahstic (random) element in this time series is shown in the lower panel.

The analysis can be run with a range of different parameters and the resulting breakdown of the data can be fed into more detailed further analyses. We won’t go into any of that at this stage. However, imagine how difficult it would be to produce a similar figure in Excel. Knowledge of R allows you to quickly apply sophisticated methods that have been developed by domain experts in a specific field very easily. All that is needed to get started is a reasonably good understanding of data formats and a basic understanding of some simple R syntax.

Inferential statistics

So, what is the difference between inferential statistical analysis and non inferential data analysis?

Fundamentally, all inferential statistics are about estimating some key parameters from data, rather than precisely measuring them. If we do have precise measures then we don’t need any inferential statistics at all. In the ideal case we could do away with all inferential statistics completely. There is nothing intrinsically scientific about using statsitical analysis. Many physicists tend to avoid stats completely. However most ecological studies do need inferential statistics of some description simply because we don’t have precise measures and the systems we are looking at tend to be very variable. So they are a necessary evil.

To illustrate the concept let’s return to the Arne data set. You should have all the layers in your gis_data directory already, so we’ll load them up again, along with the study site polygon taken from the server

library(raster)
library(sf)
library(mapview)
library(rpostgis)
mapviewOptions(basemaps =c("Esri.WorldImagery","OpenStreetMap","OpenTopoMap", "OpenStreetMap.BlackAndWhite", "OpenStreetMap.HOT"))
####################3

aspect<-raster("gis_data/aspect.tif")
slope<-raster("gis_data/slope.tif")
dtm<-raster("gis_data/dtm.tif")
dsm<-raster("gis_data/dsm.tif")
chm<-raster("gis_data/chm.tif")
chm10<-raster("gis_data/chm10.tif")
sol<-raster("gis_data/sol.tif")
twi<-raster("gis_data/twi.tif")
chm[is.na(chm)]<-0 ## Need to do this to avoid errors later as very low vegetation height values had been set to missing. Change them to zeros instead.
chm10[is.na(chm10)]<-0


conn <- dbConnect("PostgreSQL", host = "postgis", dbname = "bu" ,
    user = "msc_student", password = "msc123")
query<-"select * from arne.study_area"
study_area<-st_read_db(conn, query=query)

### Also bring in the OSM paths vector layer for the study area
query<-"select osm_id, st_intersection(geom,way) geom from 
arne.study_area a,
dorset_line d
where st_intersects(geom,way)"
paths<-st_read_db(conn, query=query)
study_area <- st_transform(study_area, 27700)
paths <- st_transform(paths, 27700)
sp_study<-as(study_area, "Spatial") ## Coerce to sp class
dbDisconnect(conn)
## [1] TRUE

Mean vegetation height in study area

OK. So what if we want to find the mean elevation, as measured by the digital terrain model, of our study site.

We could do it this way in R. Crop the dtm to the study site and mask out all the values outside the study area.

dtm<-crop(dtm,sp_study)
dtm<-mask(dtm,sp_study)

Let’s look at this visually.

mapview(dtm,col.regions=terrain.colors(1000))

This gives a summary of the raster layer.

summary(dtm)
##               dtm
## Min.        0.102
## 1st Qu.     4.715
## Median      9.122
## 3rd Qu.    13.675
## Max.       27.605
## NA's    32944.000

Now if we want the mean elevation of all the pixels we ask R to calculate it after removing the NAs.

pmean<-mean(values(dtm),na.rm=TRUE)
pmean
## [1] 9.741

Is this a fair and accurate measure of the mean elevation? Are there any reasons to doubt the reliability of the number? Well, there may be intrinsic error in the actual DTM that arise through the way the Lidar data has been processed. There may be systematic bias in the value, if for example the vertical datum was not correct. In this case every value would be shifted up or down. However these are not statistical issues. We can’t produce a confidence interval for the mean based on these considerations. There is only one value we could possibly calculate from all the pixels within the study area. So there is no reason to use statistical inference.

We’ll call the mean we have extracted the population mean, as it is the mean for the total population of pixels in the study area.

Taking a sample

In the case of elevation we are lucky to have lidar data that covers the whole area. However if we measuring some quantity from quadrats placed within the study we would be taking a sample of points. We don’t have all the values in that case. So the following analysis is just for didactic purposes in the case of the DTM but could be used in the case of the measurements taken only on the sample of quadrats on the ground when we don’t really known the population mean.

So, let’s pretend we don’t know the true mean elevation and see what we get if we take the values from a small sample of points.

rand_points<-spsample(sp_study, n=10, type='random')
plot(sp_study,axes=TRUE)
plot(rand_points,pch=21,bg=2,add=TRUE)

These are our sample values.

sample<-raster::extract(dtm,rand_points)
sample
##  [1] 11.372 16.285 15.068  8.160  4.080 15.160  1.842  7.757 11.360 12.540

This is the mean calculated from them.

smean<-mean(sample,na.rm=TRUE)
smean
## [1] 10.3624

Its not the same as the “true” mean for the population of all possible two meter raster cells within the study area. What happens if take a different set of points. I’ll wrap all the steps up as a single function

f<-function(n=10){
  rand_points<-spsample(sp_study, n=n, type='random')
  sample<-raster::extract(dtm,rand_points)
  mean(sample,na.rm=TRUE)
}

Now if I call the function I’ll be running the same analysis, but with a different set ot random points and get a new mean.

f()
## [1] 12.3026

We could try doing this one hundred times. Its very easy in R. In real life it would take a long time!

sample_means<-replicate(100,f())

mean(sample_means)
## [1] 9.787765

So the mean of the means is very close to the population mean. As we would expect, as its based on a whole lot of measurements. The interesting element of this is that we can look at the distribution of the means that we obtain from this sampling experiment.

g0<-ggplot(,aes(x=sample_means))
g0<-g0+geom_histogram(binwidth = 1,fill="grey",col=1)
g0<-g0+geom_vline(xintercept =pmean,colour="red") +xlim(4,15)
g0

Notice several points. The means taken from the rather small sample of ten points do tend to stack up around the “true” population mean and form an approximately normal, (Gaussian), distribution around it.

What would happen if we took a larger sample of random points?

sample_means2<-replicate(100,f(n=100))
g1<-ggplot(,aes(x=sample_means2))
g1<-g1+geom_histogram(binwidth = 0.5,fill="grey",col=1)
g1<-g1+geom_vline(xintercept =pmean,colour="red") +xlim(4,15)

g1

We get a narrower distribution around the true mean.

This is the underlying basis of sampling theory. The larger the sample we take, the better the estimate of the true mean should be.

We could look at where 95% of the sample means lie.

q1<-quantile(sample_means,c(0.025,0.975))
q1
##      2.5%     97.5% 
##  5.918093 13.493750
q2<-quantile(sample_means2,c(0.025,0.975))
q2
##      2.5%     97.5% 
##  8.787588 10.696391
g0+geom_vline(xintercept =q1,colour="green")

g1+geom_vline(xintercept =q2,colour="green")

So each sample provides an estimate of the true mean, and the range of estimates narrows as the sample size increases. However in real life we have a problem. We don’t actually know what the true mean is. That’s why were sampling in the first place! In addition we don’t repeat the random sampling exercise hundreds of times. We just have one random sample. How do we estimate the true mean from that?

The traditional way of finding a confidence interval

The conventional way of extracting a confidence interval from a sample involves classical statistical inference. The first logical step is to assume that the variability between the measurements in the random sample provides an estimate in the variability in the population of measurements from which the sample was taken.

Let’s get another sample of 20 points this time

 rand_points<-spsample(sp_study, n=20, type='random')
  sample<-raster::extract(dtm,rand_points)

Now to get a handle on the variability we calculate the mean and then look at the deviations from the mean by subtracting the observation from the mean.

n<-length(sample)
mn<-mean(sample)
deviations<-sample-mn

However there is a problem with working with the raw deviations. Some will be positive and others negative. If we add them all up we’ll just get zero. So we square them to make them all positive.

squared_deviations<-deviations^2
d<-data.frame(n,obs=sample,mean=mn,deviations,squared_deviations)
DT::datatable(d)

Now the sum of the squared deviations is a measure of variability.

ssq<-sum(squared_deviations)
ssq
## [1] 749.4306

However this will increase with the number of observations. So we better take that into account and calculate the mean of the squared deviations. But to account for the fact that we had to calculate the mean from the sample we lose a degree of freedom. So the divisor is n-1

Variance

The mean square is also known as the variance.

msq<- ssq/(n-1)
msq
## [1] 39.44372

Standard deviation

One problem with the mean squared deviation (variance), is that we squared all the deviations in the first place so are on the wrong scale. Let’s take the square root then.

sqrt(msq)
## [1] 6.280423

So we’ve finally got there. Its easier just to ask R for the standard deviation, as this is the quantity that we have calculated step by step.

sd(sample)
## [1] 6.280423

OK so we’ve got the standard deviation of our sample. This is a handle on how much variability we have around the mean of our sample. We would expect a rather similar value if we threw this sample away and pulled out another one, as it is an estimate of the variability in the population of pixels itself. If the standard deviation taken from our twenty measurements is low, then we would expect the next set of measurements to also vary less around the mean than if it was high. What we really need to do is to get a measure of how much we expect the means to vary.

So we want something we could perhaps call the standard deviation of the sampled means? Or maybe call it the standard error of the mean?

It turns out that the standard error is just the standard deviation divided by the square root of the sample size.

se<- sd(sample)/sqrt(n)
se
## [1] 1.404345

There is another back story going on here, that I won’t go into in any depth, as R calculations will handle it all for you. However think carefully. The standard deviation derived from the sample is itself an estimate of the true population standard deviation and so won’t be exactly the same as the population standard error. As the sample size falls this estimate in itself becomes more unreliable. The way to compensate for this involves calculating the t statistic (hence t-tests). However this will always be taken into account for you by the software. The key take home is that the 95% confidence interval for the mean is approximately 2 standard errors around the sample mean (1.96 for large samples, falling to around 2.5 for very small samples).

In practical terms what we’ve got from all this is a measure of the precision of our estimate. I’ll express it in plain informal English.

The true population mean will fall within 2 standard errors of the sample mean in 95 out of 100 samples

Let’s see if it works here.

Upper limit

mean(sample) + 2* se
## [1] 11.92714

Lower limit

mean(sample) - 2 * se
## [1] 6.309759

True mean

pmean
## [1] 9.741

Yippee! The mean is within the confidence interval. And it should be 95% of the time.

The R code to find a formal confidence interval in one go is like this. The syntax will make sense later on when we look at statistical models.

confint(lm(sample~1))
##                2.5 %   97.5 %
## (Intercept) 6.179122 12.05778

Notice that this is slightly wider than two standard errors. This is because the small sample correction that was mentioned has been applied. We should use this correction for formal inference, but remember the rule of thumb regarding the interpretation of the standard error of the mean when you see the value used in articles.

The assumption of normaiity

The simplest form classical inference used to find a confidence intervals and test null hypotheses assumes the population of observations forms a normal distribution. Under a normal distribution 95% of the observations fall within 2 standard deviations of the mean.

Let’s look at the true population distribution by making a histogram of all the pixel values

g0<-ggplot(,aes(x= values(dtm)))
g0<-g0+geom_histogram(binwidth = 1,fill="grey",col=1)
g0

That does not look like a smooth Gaussian ditribution. We would not expect it to either, given that the distribution of elevation values from a terrain model depends on the type of terrain and there is no reason to think it will follow a normal distribution. However the histograms of the sample means looked much closer to being normally distributed. This is an example of the central limit theory at work. If you take samples from almost any distribution of values and calculate the mean of each sample then the sample means will tend to follow a normal distribution, even when the values in the underlying population do not.

Does the lack of normality in the underlying population invalidate the confidence interval? That’s a rather difficult question to answer. Technically it does. The reason for this is that the sample standard deviation that is used in the calculations will be more variable than it should be. However some authors argue that when the sample size is large enough we can ignore this effect as we still have normality of the sampled means through the central limit theorem. We will return to the question of critical evaluation of statistical assumptions many times in the course. There are always solutions to problems, which is why there are so many different types of statistical analysis to choose from. This is one of the challenges faced by all data analysts.

An alternative to the traditional calculations

The theory that allows confidence intervals for the population mean to be calculated from sample values was worked out long before the advent of computers. We still use it in many statistical models by default. However there is another way of thinking about the problem that is useful and comes into play in many modern methods.

If we have a vector of measurements that form a sample from a population of potential measurements we can look at the problem of making inference this way.

  • We can’t go out and get another set of measurements because its time consuming.
  • However we can use the measurements we have got to form a new sample.
  • We can do that by randomly sampling from them, instead of randomly sampling from the whole population.

This is the basis of a technique called bootstrapping. There are may variations on the theme, but bootstrap techniques are based on some sort of resampling from a sample (pulling yourself up by your own bootstraps rather than looking beyond for help).

One of the advantages of bootstrapping is that it can be used to find confidence intervals for almost any statistic that can be calculated on a sample without making assumptions regarding the distribution.

bootstrap_ci<-function(x)
{
f<-function(x)mean(sample(x,length(x),replace=TRUE)) # ressample with replacement
boot_strap<-replicate(1000,f(x)) # Repeat 1000 times
q1<-quantile(boot_strap,c(0.025,0.975)) ## Find the quantiles
q1     ## Return them
}

bootstrap_ci(sample)
##      2.5%     97.5% 
##  6.446829 11.870680

Compare this with the model confidence interval that assumes normality.

confint(lm(sample~1))
##                2.5 %   97.5 %
## (Intercept) 6.179122 12.05778

The two results are very similar, which suggests that there was nothing that much wrong with using the classic technique. Simulation experiments have shown that classical inference is actually rather robust to minor violations of the assumption of normality. There are more serious problems than that.

So, statistical inference essentially is all about confidence. We don’t want to be wrong, so instead of stating that the mean elevation of the study site is 9.1184501 precisely we give an interval in which we think the true value could lie 95% of the time. We don’t really need to be overly precise about this interval, as different methods will give slightly different results. Based on a single sample of 20 measurements we should be comfortable with an estimate of mean elevation of the study area as being between 6 and 12 meters above sea level as there is only a 1 in 20 chance that we would be wrong about this. If we take more measurements our confidence interval narrows.

Now, in the context of your particular study, you should be able to estimate the mean density of pine trees within the study area using the same technique. Not only that, but if you know the total area of the study site you can convert that into an estimate of the upper and lower number of pines that would need to be pulled to clean the area.

The logic behind random sampling

The mathematics behind these forms of inference will work providing every pixel on the map had an equal chance of falling into the sample. If you only took measurements on the lowest point of the study site you clearly would not have an unbiased sample. Purely random sampling is desirable for formal inference, but not always possible in real life. The real necessity is for a representative sample, that takes into account the variability in the population. Other sampling schemes can have this property. There are in fact many additional issues, involving spatial dependence between observations, that we will look at in the more advanced sections of the course.

Finding significant differences

The next step along the road to more sophisticated statistical modelling is to recognise that if we can draw inferences on single quantities we can also use the same mechanisms to draw inferences about differences between quantities measured on groups of sampled observations.

Let’s test a very simple hypothesis (which could be easily confirmed using all the data). We’ll take 20 observations on elevation from a 20 meter buffer around the paths in the study site. Let’s set that up using the GIS capabilities of R.

p20<-st_union(st_buffer(paths,20))
p20<-st_intersection(p20, study_area)
plot(p20,axes=TRUE)
plot(sp_study, add=TRUE)

sp_p20<-as(p20, "Spatial") ## Coerce to sp class

OK, so we’ll compare the elevations of 20 points within the buffer with 20 random points in the study area as a whole to see if the area around the paths differs from the area as a whole. Again, in the case of the elevation we could just calculate this from all the pixels, but this is for didactic purposes.

rand_points<-spsample(sp_study, n=50, type='random')
sample1<-raster::extract(dtm,rand_points)

rand_points<-spsample(sp_p20, n=50, type='random')
sample2<-raster::extract(dtm,rand_points)

Sample from entire area

Summary

summary(sample1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.675   5.989  10.053  10.212  13.056  26.610

Confidence interval

confint(lm(sample1~1))
##                2.5 %   97.5 %
## (Intercept) 8.448172 11.97679

Sample from buffer around paths

Summary

summary(sample2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.092   8.176  11.668  11.850  15.086  26.640

Confidence interval

Confidence interval

confint(lm(sample2~1))
##                2.5 %   97.5 %
## (Intercept) 10.22614 13.47478

Confidence interval plot

We can plot the confidence intervals for each sample side by side. To do this we need to stack the results of our sampling into one data frame.

d1<-data.frame(area="site",dtm=sample1)
d2<-data.frame(area="path_buffer20",dtm=sample2)
d<-rbind(d1,d2)

Now plot the results.

g0<-ggplot(d,aes(x=area,y=dtm))
g0<-g0+stat_summary(fun.y=mean,geom="point")
g0 + stat_summary(fun.data=mean_cl_normal,geom="errorbar")

The interpretation is that if there is clear space between the confidence intervals then we will always find a significant difference. If there is a slight overlap we may need to test for a difference. If there is a large overlap, as here, the test will be not significant at the 0.05 level.

Running a simple t-test

To test the null hypothesis of no differences between the means we can use a simple t-test.

t.test(d$dtm~d$area)
## 
##  Welch Two Sample t-test
## 
## data:  d$dtm by d$area
## t = -1.3726, df = 97.338, p-value = 0.173
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.0063851  0.7304251
## sample estimates:
##          mean in group site mean in group path_buffer20 
##                    10.21248                    11.85046

Or we can use one way ANOVA that you will see in more detail next week. The result should be the same.

anova(lm(d$dtm~d$area))
## Analysis of Variance Table
## 
## Response: d$dtm
##           Df Sum Sq Mean Sq F value Pr(>F)
## d$area     1   67.1  67.074  1.8839  0.173
## Residuals 98 3489.1  35.603

Interpretation of a p-value

You may have been told in the past that a null hypothesis test is a fundamental tool for scientific inference. You may even have set up null hypotheses in the form of a scientific hypothesis for testing. In certain situations this can make sense, but these situations tend to occur in the context of planned experiments. In an observational setting you need to be very careful when interpreting a null hypothesis test and avoid the pitfall of equating it with a genuine scientific hypothesis. A p-value obtained from a t-test is the result of looking at the distance between the two means after dividing by the pooled standard errors in order to calculate the t-statistic. If the absolute value of t is larger than about 2 (again there is a correction for degrees of freedom, i.e. sample size) then the p-value will be below 0.05. Here it is -1.37 so the test is not significant. However this does not imply that there really is no difference between the two population means. Almost all things we measure do differ. It just implies that there is insufficient evidence derived from the data we obtained to determine that there is a difference. Confidence intervals tend to be much more useful generally than p-values as they make a statement about the size of the difference.

Let’s get the confidence interval from the ANOVA (again, more detail on this next week)

confint(lm(d$dtm~d$area))
##                          2.5 %    97.5 %
## (Intercept)          8.5379071 11.887053
## d$areapath_buffer20 -0.7302237  4.006184

We can see that the confidence interval includes zero, so we can’t determine a difference from these samples.

Conclusion

In this class you have seen some of the fundamental ideas behind statistical analysis. They can be summed up as

  1. Don’t use inferential methods unless they are necessary. Not all data analysis needs them.
  2. Don’t infer quantities if you actually do have a reliable measurement of them.
  3. Don’t become too obsessed about the assumption of normality. Even when it is in fact important, other methods exist to draw parametric inference from data that don’t comply with the assumption. Non parametric methods are rarely useful
  4. Completely random sampling is often the best way to ensure a representative sample, but may not always be possible. Think carefully about the consequences of your sampling scheme, particularly in an observational setting.
  5. Don’t test meaningless null hypotheses just because it sounds scientific. It often isn’t.
  6. Do be aware that scientific inference concerns the strength of evidence. Large samples allow more reliable inferences to be drawn than small samples.
  7. Do use confidence intervals whenever possible, rather than raw p-values.

A reminder on data formatting

Remember the rules for data formatting.

  1. Make sure that the column headers consist of only one word. No spaces or unusual characters. Make the column headers short and informative. Don’t include units or other descriptors. You can annotate plots later. If you don’t follow this rule the file may not load. Be especially careful to avoid having “loose” observations or values typed into cells to the right of the headers.
  2. Make sure that columns intended to hold numerical values ONLY hold numerical values. No letters or annotations. If you don’t follow this rule you will find that the columns have been interpreted as factors (i.e. grouping variables) rather than measurements.
  3. Make sure that there are no missing values in the file. If you can’t avoid this, use the two letters NA to represent missing values. There is no guarantee that an analysis will run correctly with missing values and they should be avoided. A well conducted study does not have missing values. You can sometimes avoid problems by “imputing” a likely value for a missing value. Make sure you document this.
  4. Use one column per variable. Do not spread measurements of the same type over several columns.
  5. If you do want a column to be interpreted as a factor (categorical variable) make sure that you use some letters in the values. If you only use numbers it will be assumed to be numeric. So use something like Site1 or S1 rather than 1, 2 3. if you want to group data by site.
  6. You must not include anything other than raw data in the sheet to be converted into CSV format. No calculated subtotals. No figures. No spaces between rows.