#LyX 2.0 created this file. For more info see http://www.lyx.org/ \lyxformat 413 \begin_document \begin_header \textclass literate-article \begin_preamble \date{} \usepackage{ragged2e} \RaggedRight \usepackage{lmodern} \end_preamble \use_default_options false \maintain_unincluded_children false \language british \language_package default \inputencoding utf8 \fontencoding global \font_roman times \font_sans lmss \font_typewriter cmtt \font_default_family rmdefault \use_non_tex_fonts false \font_sc false \font_osf false \font_sf_scale 100 \font_tt_scale 100 \graphics default \default_output_format default \output_sync 0 \bibtex_command default \index_command default \float_placement H \paperfontsize default \spacing single \use_hyperref true \pdf_bookmarks true \pdf_bookmarksnumbered false \pdf_bookmarksopen false \pdf_bookmarksopenlevel 1 \pdf_breaklinks true \pdf_pdfborder true \pdf_colorlinks true \pdf_backref false \pdf_pdfusetitle true \papersize default \use_geometry true \use_amsmath 1 \use_esint 1 \use_mhchem 1 \use_mathdots 1 \cite_engine natbib_authoryear \use_bibtopic false \use_indices false \paperorientation portrait \suppress_date false \use_refstyle 0 \index Index \shortcut idx \color #008000 \end_index \leftmargin 2cm \topmargin 3cm \rightmargin 2cm \bottommargin 2cm \headheight 1cm \headsep 1cm \footskip 1cm \secnumdepth 0 \tocdepth 3 \paragraph_separation skip \defskip medskip \quotes_language english \papercolumns 1 \papersides 1 \paperpagestyle default \tracking_changes false \output_changes false \html_math_output 0 \html_css_as_file 0 \html_be_strict false \end_header \begin_body \begin_layout Title Quantitative Ecology \end_layout \begin_layout Standard \begin_inset CommandInset toc LatexCommand tableofcontents \end_inset \end_layout \begin_layout Section Capturing, storing and loading data \end_layout \begin_layout Standard One of the commonest questions regarding data analysis does not concern the analytical methods at all. It is simply ... \begin_inset Quotes eld \end_inset How do I get my data into the statistical software that I want to use? \begin_inset Quotes erd \end_inset The answer to the question involves taking a step back from the software and thinking carefully about the structure of data. If the data were entered and stored in a standard, conventional format then importing them into R or SPSS will always be straightforward. If data have been captured in an \begin_inset Quotes eld \end_inset \emph on ad hoc \emph default \begin_inset Quotes erd \end_inset manner then the transfer could be time consuming and messy. Keeping your data in a format that can be easily explained to an experienced data analyst allows you to seek help from them if it is needed. Of all the skills that you learn on this course, this may prove to be the most useful. \end_layout \begin_layout Subsection Data management: Databases and spreadsheets \end_layout \begin_layout Standard The best tool for capturing and storing data is a relational data base. However learning to use a database can take some time. Setting up a full relational data base would be overkill for most student research projects. Spreadsheets of some form are thus the most commonly used software for data capture and storage. \end_layout \begin_layout Standard Although spreadsheets have some features of databases, they lack many of the more powerful features. Before entering data into a database the user has to decide on the structure it will take and define the variables that are going to be included. Databases thus enforce correct data storage. In contrast, when using a spreadsheet the user is given great flexibility to arrange the data in any manner. Although this seems to be an advantage, in fact there are many pitfalls. Most of the problems that arise when transferring data from spreadsheets to statistical packages are the result of not having planned the data structure before the numbers are captured. Statistical software \series bold can \series default read data directly from spreadsheets. However statistical software \series bold cannot \series default read data from spreadsheets if the raw data has not been stored in the correct manner. \end_layout \begin_layout Standard As a result, some data analysts advise that spreadsheets should never be used for scientific research. However spreadsheets are potentially useful tools. Many researchers use them on a daily basis. They allow the data to be quickly \begin_inset Quotes eld \end_inset eye-balled \begin_inset Quotes erd \end_inset and edited by hand. If you learn how to use pivot tables and other advanced features of spreadsheet s they are also quite powerful tools for processing data. However all the common data manipulation techniques carried out in spreadsheets can be achieved easily in R. Data manipulation in R can appear difficult at first, but is much more powerful. Furthermore, providing the data that you collect and store has a conventional structure you can use solutions to common problems that have been found by others. \end_layout \begin_layout Subsubsection Raw data vs processed data \end_layout \begin_layout Standard Data is like food. When it is provided in a raw state it is possible to process it and combine it in many forms. However once either food or data have been prepared it can be impossible to undo the process. Raw data is therefore always more valuable than data that have been cooked up for a particular form of analysis. You should always keep a copy of your data in the simplest \begin_inset Quotes eld \end_inset raw \begin_inset Quotes erd \end_inset form, even if you also have many versions that you can pull out that have been \begin_inset Quotes eld \end_inset prepared earlier \begin_inset Quotes erd \end_inset and are more easily digested. Databases prevent data manipulation being confused with data storage. Unfortunately spreadsheets encourage loose data structures and allow the user to mix analyses directly with the raw data. If these manipulations actually break up the raw data it can be very difficult even for a trained analyst to start again from scratch. \end_layout \begin_layout Standard The most common errors involve breaking three very simple rules for maintaining data in a raw state. \end_layout \begin_layout Enumerate All the values in any column must be the result of exactly the same measurement process. \end_layout \begin_layout Enumerate Aggregated data must not be mixed with raw data. \end_layout \begin_layout Enumerate There should be one (and only one) column for each measured variable. \end_layout \begin_layout Standard If you break these rules, this may occur. \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status collapsed \begin_layout Plain Layout \backslash setkeys{Gin}{width=2in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/xlproblem.jpg} \end_layout \end_inset \end_layout \begin_layout Standard Let's assume that very few students would do this! However there are subtle errors that can be almost as bad. \end_layout \begin_layout Standard Imagine that we weighed twenty animals. Ten were female and ten were male. Based on the common practice of calculating marginal (row or column) totals in spreadsheets we might perhaps decide that table below is a sensible way to provide the data. \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status collapsed \begin_layout Plain Layout \backslash setkeys{Gin}{width=2in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/xl1.png} \end_layout \end_inset \end_layout \begin_layout Standard This does not appear particularly wrong. However there are problems. This is the sort of data storage that arises as a result of the spreadsheet approach to data. Spreadsheets were developed originally for accountants. Accountants always think in terms of \begin_inset Quotes eld \end_inset the bottom line \begin_inset Quotes erd \end_inset . The researcher has taken advantage of the bottom line (margin) in order to begin analysing the data. The mean and standard deviations are calculated for each group. This initially seems useful. However adding a bottom line breaks rule two. Aggregated data has now become mixed with raw data. By inserting calculations to the lower margin rule one has also be broken. Not all the values in the column are based on the same form of measurement, as the mean and standard deviations are derived from ten animals rather than one. \end_layout \begin_layout Standard Rule three is also broken in this example. Here we have two columns for one variable. The single variable in question is weight. Note that the row name labels are arbitrary. They do not refer to a single animal. An animal cannot be both male and female at once. This data can be imported into R or SPSS, but the bottom two lines would have to be removed first and the data further manipulated before use. A much better way to capture and store the data is shown below. \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status collapsed \begin_layout Plain Layout \backslash setkeys{Gin}{width=2in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/xl2.png} \end_layout \end_inset \end_layout \begin_layout Standard There are some important points to notice. The correct form for the raw data uses a categorical variable (known as a factor in R) to determine the groups. This may make manipulation in a spreadsheet slightly more difficult, but it fits the logic used in statistical analysis. The animal's gender is in fact a form of measurement made on the individual. So, it is a categorical variable that can take two states (male or female). As we have seen, we may sometimes use categorical variables in statistical analyses such as chi-squared tests of association, so they should be held in a column rather than being used in the header. We have also included an ID column to identify the individual from which the measurements were taken. In this case the animals are numbered from 1 to 20. This potentially allows us to link the measurements to others that were taken on the same animal. \end_layout \begin_layout Standard If we were capturing data in a database we would have to define the sort of data we could enter before we started entering data. The gender column would hold characters, the weight column floating point numbers and the ID column integers. We would not be able to set up the \begin_inset Quotes eld \end_inset wrong \begin_inset Quotes erd \end_inset form for the data. Notice that once we have the right structure there is no limit to the amount of data that could be entered. We could hold measurements on an infinite number of animals just by entering the gender and weight for each. We can also easily extend the structure to include other measurements such as length etc simply by adding more columns. Each measurement is a variable and each row refers to the same animal. \end_layout \begin_layout Subsubsection Data frames \end_layout \begin_layout Standard The data format we have just seen is known as a data frame in R. You will already have imported and used data in this form. The R command str provides details of the structure of the data. If we have the data shown above in a data frame called \begin_inset Quotes eld \end_inset d \begin_inset Quotes erd \end_inset then the following information is provided. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout options(prompt=" ",continue=" ") \end_layout \begin_layout Plain Layout set.seed(1) \end_layout \begin_layout Plain Layout ID<-1:20 \end_layout \begin_layout Plain Layout Weight<-round(rnorm(20,100,3),1) \end_layout \begin_layout Plain Layout Gender<-rep(c("Male","Female"),each=10) \end_layout \begin_layout Plain Layout d<-data.frame(ID,Gender,Weight) \end_layout \begin_layout Plain Layout rm(list=c("Weight","Gender","ID")) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout str(d) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard When data is imported into R any column that holds text is automatically interpreted as a factor. Columns that hold only whole numbers are considered to be integer variables. Columns with decimals are numeric variables. When you import data from a spreadsheet into SPSS then columns with text should be of \begin_inset Quotes eld \end_inset type \begin_inset Quotes erd \end_inset string and \begin_inset Quotes eld \end_inset measure \begin_inset Quotes erd \end_inset nominal. Numeric variables of \begin_inset Quotes eld \end_inset type \begin_inset Quotes erd \end_inset numeric and measure scale. \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status collapsed \begin_layout Plain Layout \backslash setkeys{Gin}{width=6in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/spss.png} \end_layout \end_inset \end_layout \begin_layout Subsection Data documentation \end_layout \begin_layout Standard You might have noticed that in the example the units of measurement (g) have been dropped from the column header. This makes data import and manipulation easier. However the information itself (measurement units) is extremely important and it must be recorded along with the data. If you capture your data directly in SPSS the program itself encourages you to provide documentation for each variable. You would also usually do this if you design a data base to hold your data. In a spreadsheet documentation is often placed on a second line, below the name of the variable, sometimes in the form of quite lengthy text. This makes it very difficult to import the data into R or SPSS cleanly. So if you wish to capture data in a spreadsheet and then move them easily and quickly into R you should ensure the following. \end_layout \begin_layout Enumerate Use one column per measurement type. \end_layout \begin_layout Enumerate Include columns for all the categorical (grouping) factors. \end_layout \begin_layout Enumerate Aim to use a single (short) word for column headers. Do not include spaces, accents or non standard \begin_inset Quotes eld \end_inset foreign \begin_inset Quotes erd \end_inset characters. Underscores may be used if necessary. \end_layout \begin_layout Enumerate Do not add any calculations at all to the raw data sheet. They can be done on a separate linked sheet if you wish. \end_layout \begin_layout Enumerate Ensure that all cells with missing data hold the letters \begin_inset Quotes eld \end_inset NA \begin_inset Quotes erd \end_inset , standing for Not Available in R. This may require a search and replace. Avoid using numeric codes such as -9999 for missing data or leaving cells blank. \end_layout \begin_layout Enumerate Store detailed meta data (units of measurements, procedure used to take the measurements etc) on a separate sheet or in a similarly named text document. \end_layout \begin_layout Enumerate Make sure that the link between the data and meta data is very clear. Provide the raw data files and the clearly linked meta data to anyone working on the data. \end_layout \begin_layout Subsubsection Moving data from a spreadsheet to R \end_layout \begin_layout Standard If you have set up your data table according to the guidance above, it is easy to export in into either R or SPSS. The raw data table should simply be saved as a plain ASCII text file first. Comma Separated Variables (CSV) format is very commonly used, but you may use tabs to separate columns. This can be done from any spreadsheet program. You can only save one sheet at a time, but this is not a problem as each data frame should be held as a separate file anyway. \end_layout \begin_layout Subsubsection Importing data into R \end_layout \begin_layout Standard We will now look at how to get the data into R. The concept of the working directory is very important in R. Paths to files are relative to this working directory. You can find your working directory using this command. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout getwd() \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard Using the windows GUI you can change the working directory using the \begin_inset Quotes eld \end_inset change dir \begin_inset Quotes erd \end_inset command under the file menu. The command line version would use setwd(). \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout setwd("C:/Documents and settings/etc") \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard When you are working in R you should usually first change the working directory to the folder where your data is stored. Try downloading the example weights data by clicking here \begin_inset CommandInset href LatexCommand href target "http://tinyurl.com/QEData/examples/weights.csv" \end_inset . Save it to a folder on the desktop or in your documents. Then move to this directory in R. \end_layout \begin_layout Standard Now read in the data either using a command or the BiodiversityR interface. The latter can be opened by typing \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout library(BiodiversityR) \end_layout \begin_layout Plain Layout BiodiversityRGUI() \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard Choose \begin_inset Quotes eld \end_inset import data \begin_inset Quotes erd \end_inset from the data menu. \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status collapsed \begin_layout Plain Layout \backslash setkeys{Gin}{width=2in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/rcom.png} \end_layout \end_inset \end_layout \begin_layout Standard This presents you with a choice of options. In this case the file is delimited by commas. \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status open \begin_layout Plain Layout \backslash setkeys{Gin}{width=1.5in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/rcom1.png} \end_layout \end_inset \end_layout \begin_layout Standard The result is a line of R code that reads in the data and assigns it to the dataframe named \begin_inset Quotes eld \end_inset d \begin_inset Quotes erd \end_inset . You can use this as a model for writing your own lines directly in the console, although the line suggested by BiodiversityR is often more \begin_inset Quotes eld \end_inset verbose \begin_inset Quotes erd \end_inset than is necessary. \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status open \begin_layout Plain Layout \backslash setkeys{Gin}{width=3in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/rcom2.png} \end_layout \end_inset \end_layout \begin_layout Standard Notice that the menu on the native R console does not have options to open files, only to change the working directory. However if you have set the working directory to the place where the data is stored the file can then be loaded by simply typing. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout d<-read.csv("http://tinyurl.com/QEData/examples/weights.csv") \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout d<-read.csv("weights.csv") \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard This is quick enough. Try it, and check the properties of the data using str and fix. Then see what happens when you exit from R. You will be asked if you want to save the workspace image. If you answer yes then you will find two new files inside the working directory. One will be called .Rdata and the other .Rhistory. Notice that the dot represents a hidden file under Linux or Unix (where R first evolved.) \end_layout \begin_layout Standard The next time you want to use R, instead of clicking on the R icon you can move to this directory and click on the .Rdata file. The R console will now open and you will now find that you have all the data you were working with at the moment of your last exit. \end_layout \begin_layout Standard You can get a list of the data objects available by typing ls() \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout ls() \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard The .Rhistory file is a text file with the list of commands you executed. This is very useful if you forgot what you were doing in R or have not built a script with the commands as you go along. \end_layout \begin_layout Subsection Data manipulation \end_layout \begin_layout Standard Although R code is shown throughout this course, in most cases a GUI alternative has also been provided that produces the same results using either SPSS or BiodiversityR. This is not really possible when looking at data manipulation. Although data can be manipulated and reformatted to some extent using SPSS, in fact very few people use SPSS for this. The menu is very difficult to follow, and the results are not always those that are expected. In reality most data manipulation is achieved through cutting and pasting in Excel. The problem with this, is that while it can work effectively for small data sets, it can lead to a lot of wasted time when data sets become larger. It is well worth knowing a few of the ways in which R can help to automate data manipulation. \end_layout \begin_layout Subsubsection Unstacking \end_layout \begin_layout Standard We began this discussion of data formats by looking at the way the animal weights data might be held in a spreadsheet in a \begin_inset Quotes eld \end_inset wide \begin_inset Quotes erd \end_inset form consisting of one column for male animals and one for female. Although this is not a good way of storing data, it may sometimes be useful when actually working with the data. If we have our data in the conventional raw format it can easily be converted into this form by unstacking. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout d2<-unstack(d,Weight~Gender) \end_layout \begin_layout Plain Layout d2 \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard The formula Weight~Gender tells R that Weight is the measurement that we are interested in and Gender is the grouping factor. Notice that this is the same formula that we would use to form a grouped boxplot, and which we will later use for model building. \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status open \begin_layout Plain Layout \backslash setkeys{Gin}{width=2in} \end_layout \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout boxplot(Weight~Gender, data=d, col="grey") \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection Stacking \end_layout \begin_layout Standard To stack a data frame that is in a wide format we write \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout d3<-stack(d2) \end_layout \begin_layout Plain Layout d3 \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard Notice that R cannot know what the measurements are, so R uses the headings \begin_inset Quotes eld \end_inset values \begin_inset Quotes erd \end_inset and \begin_inset Quotes eld \end_inset ind \begin_inset Quotes erd \end_inset (group indicator). We can rename the columns. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout names(d3)<-c("Weight","Gender") \end_layout \begin_layout Plain Layout d3 \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection Using the clipboard \end_layout \begin_layout Standard You might wonder if data can be copied between a spreadsheet and R using the clipboard. This would be very convenient for quick analyses. The answer is yes, but you need to do some typing (or maybe use the Biodiversit yR menu). Simply copying and pasting from Excel into the R console does not work. You would just paste the numbers into the console which shows the output rather than into the working memory. However if you have data held in the clipboard that has been copied from columns in Excel this line will read it in. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout d<-read.table(file="clipboard",header=T,sep=" \backslash t") \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard And this line will write out a data frame that can be pasted into Excel. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout write.table(d,file="clipboard",row.names=F,sep=" \backslash t") \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard To avoid the typing in the future you can define some simple functions that can be loaded when R starts. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Xin<-function()read.table(file="clipboard",header=T,sep=" \backslash t") \end_layout \begin_layout Plain Layout Xout<-function(d)write.table(d,file="clipboard",row.names=F,sep=" \backslash t") \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard Then this reads data in \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout d<-Xin() \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard And this sends a dataframe to the clipboard for pasting into Excel. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout Xout(d) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection More about data frames \end_layout \begin_layout Standard As mentioned previously, data frames are a central concept in R. It is essential that you become used to them. Think of a data frame as a single sheet in spreadsheet and you won't go far wrong. You refer to each column using the name of the data frame with a $ followed by the name of the column (eg d$Weight). If you are too lazy to do this there is an alternative. You can \begin_inset Quotes eld \end_inset attach \begin_inset Quotes erd \end_inset the data frame. This simplifies the reference to the data base. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout attach(d) \end_layout \begin_layout Plain Layout Weight \end_layout \begin_layout Plain Layout detach(d) \end_layout \begin_layout Plain Layout d$Weight \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard You can try a command such as \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout \backslash setkeys{Gin}{width=2in} \end_layout \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout boxplot(d$Weight~d$Gender,col="grey",boxwex=0.2) \end_layout \begin_layout Plain Layout @ \end_layout \begin_layout Plain Layout \end_layout \end_inset \end_layout \begin_layout Standard Attaching the data frame simplifies matters a bit. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout attach(d) \end_layout \begin_layout Plain Layout boxplot(Weight~Gender,col="grey",boxwex=0.2) \end_layout \begin_layout Plain Layout detach(d) \end_layout \begin_layout Plain Layout @ \end_layout \begin_layout Plain Layout \end_layout \end_inset \end_layout \begin_layout Standard If you forget to detach d you run the risk of confusion if you have another data frame with variables called Gender and Weight. It is often safer to use the d$Weight syntax. \end_layout \begin_layout Standard The ~ symbol is used in models to suggest \begin_inset Quotes eld \end_inset some variable is modelled as some other variable \begin_inset Quotes erd \end_inset . It can be used with data in this form to plot a boxplot of weight \begin_inset Quotes eld \end_inset modelled by \begin_inset Quotes erd \end_inset sex. \end_layout \begin_layout Standard R has a second set of graphics known as lattice. We will look at these in more detail in a later section. For the moment we can produce histograms using .. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout library(lattice) \end_layout \begin_layout Plain Layout histogram(~Weight|Gender,data=d,col="grey",layout=c(1,2)) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout library(lattice) \end_layout \begin_layout Plain Layout print(histogram(~Weight|Gender,data=d,col="grey",layout=c(1,2))) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard Notice that in this case,as others, the data frame can be used as an argument in the function. \end_layout \begin_layout Subsubsection Subsetting \end_layout \begin_layout Standard A data frame has two dimensions. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout dim(d) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard This tells us that the dataframe has 20 rows, and 3 columns. The rows are always given first by R. A data frame can be subsetted. So the first row is given by .. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout d[1,] \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard The first three rows of the third column by \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout d[1:3,3] \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard We can use a condition to subset the data. For example if we only want animals that weigh over 100g. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout d[d$Weight>100,] \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard The data can be ordered by weight using the following syntax. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout d[order(d$Weight),] \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsection Managing complex ecological data \end_layout \begin_layout Standard The examples so far have used very small data sets with few variables. However ecologists typically have to deal with many variables at once. The most familiar example is the species by sites matrix. Any project that uses quadrats, soil cores or catch records will produce some form of species by site matrix. These can be very large, particularly of the study involves a diverse community such as a tropical forest. \end_layout \begin_layout Standard Here is an example of a small species by site matrix that is included in the package vegan. Species names are shortened and only the top corner of the matrix is shown here. \end_layout \begin_layout Standard \begin_inset ERT status collapsed \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout library(vegan) \end_layout \begin_layout Plain Layout data(dune) \end_layout \begin_layout Plain Layout dune[1:5,1:10] \end_layout \begin_layout Plain Layout @ \end_layout \begin_layout Plain Layout \end_layout \end_inset \end_layout \begin_layout Standard There are many challenges in handling data of this type. \end_layout \begin_layout Enumerate The number and names of the species that may be found is often not known at the beginning of the study. \end_layout \begin_layout Enumerate Some species may not be identified until they have been sent to a specialist. Some species identities may be re-evaluated and changed during the course of the study. \end_layout \begin_layout Enumerate Quadrats or soil cores may have a hierarchical structure (i.e. several quadrats are placed at a site and the results pooled) \end_layout \begin_layout Enumerate Most studies aim to combine the results of the analysis of the species by sites matrix with environmental data taken from the sites. They usually also involve spatial analysis and mapping. \end_layout \begin_layout Standard Given all these issues, what is the best data format for capturing the results? It is tempting to think that the data can be captured directly as a species by site matrix. This seems reasonable, as each measurement on a separate species is a variable in itself. However unless the data set is very small, with only a few species and sites, the approach will lead to problems as a result of the issues listed above. \end_layout \begin_layout Standard The way to look at the problem is to think about the observations that are made on each quadrat or sediment core. Usually only a small subset of the species are found in each one. The observation of a species is accompanied by a measurement such as abundance, biomass, percentage cover etc. So if we were calling out relative cover in a quadrat to a data recorder we might say \begin_inset Quotes eld \end_inset Agrostis tenuis 50%, Calluna vulgaris 30%, Erica cinerea 20% \begin_inset Quotes erd \end_inset etc. \end_layout \begin_layout Standard What happens if we suddenly come across a species that we are not familiar with? How can we fill in the matrix? \end_layout \begin_layout Standard The solution is to give the species an identifying number and use this instead of a species name. We can then determine the name later and add the information back into the analysis. To do this we use tables that we join together through the use of a common index column. One table holds the information about the species. Another table holds information taken from each quadrat or core. We usually also need a third table holding information on the environmental variables for each quadrat or site. \end_layout \begin_layout Standard The easiest way to learn how to use this system is to follow an example. Once you have understood the example it can be easily adapted to any study. \end_layout \begin_layout Standard The example involves an invertebrate survey of Poole harbour that was carried out in October and November 2009 using a 500m x 500m sampling grid of 80 sites. The area surveyed included mudflats, sandflats and areas of mixed sediment between Mean High Water Springs (MHWS) and Mean Low Water Springs (MLWS). Five core samples (10cm diameter) were obtained to depth of 15cm from each of 80 sampling sites. An assessment of macro-algal cover and substrate was made at each site. A total of 88 invertebrate species were recorded in the core samples comprising 38 annelid species, 21 Mollusca, 21 Crustacea and 8 other species, including sea anemones and insect larvae \end_layout \begin_layout Standard So, for each site we have data on invertebrate abundances from five cores. We also wish to combine data on algal cover and substrate for statistical analysis and want to map the results. \end_layout \begin_layout Subsubsection The species table \end_layout \begin_layout Standard There are quite a large number of species present in the area. The researchers could not have predicted in advance which species would be found. It would have been difficult to set up a recording sheet as a species by sites matrix. \end_layout \begin_layout Standard The way to address this is to build a table with a column that holds a consisten t identifying code for each species found, together with additional columns for species name, genus, family etc. \end_layout \begin_layout Standard Here is a table of this type for this study. In this case the species are in alphabetical order as the table was in fact built after the event for demonstration purposes. When capturing the data in real time the species are much more likely to be given their identifying number in the order in which they are found. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Species<-read.csv("http://tinyurl.com/QEData/PH/SpeciesTable.csv") \end_layout \begin_layout Plain Layout str(Species) \end_layout \begin_layout Plain Layout head(Species) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard The identifier in the table need not be a number. A code consisting of letters could be used. But, it must be unique to the species. It is useful to have a shortened version of the species name (in this case an 8 letter code) which can be used as a column header for a species by sites matrix. \begin_inset Foot status collapsed \begin_layout Plain Layout The make.cepnames function in the R package vegan can be used for this \end_layout \end_inset \end_layout \begin_layout Subsubsection The abundance table \end_layout \begin_layout Standard Once an identifier has been given to the species this is used to link the species name to the data on its abundance in each core or quadrat. This saves typing out the full species name many times when capturing data. Doing this would not only be time consuming but can easily lead to errors if a name is mistyped. This way we only type the species name once. When recorded the abundances, or any other data on the species found in the cores, the unique identifier must always be used. The data are held in a table like this. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Abun<-read.csv("http://tinyurl.com/QEData/PH/AbunTable.csv") \end_layout \begin_layout Plain Layout str(Abun) \end_layout \begin_layout Plain Layout head(Abun) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection Merging the data \end_layout \begin_layout Standard The two tables that compromise the raw data are then merged together so the species identities are placed against the abundances. This is very easy once they have been loaded into R. Providing that one, \series bold and only one \series default , of the columns has \series bold exactly \series default the same name (remember that R is case sensitive) this simple command does the trick. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout SpAbun<-merge(Species,Abun) \end_layout \begin_layout Plain Layout str(SpAbun) \end_layout \begin_layout Plain Layout head(SpAbun) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection Aggregating the data \end_layout \begin_layout Standard Recall that five cores were taken from each site. Although it can be useful to look at the variability between these cores, for simplicity the cores can be aggregated to form a sum (or mean) of abundance s for each site. The easiest way to do this in R is to use the very powerful reshape package. The following line defines a \begin_inset Quotes eld \end_inset molten \begin_inset Quotes erd \end_inset form for the data. Once the data are molten, most forms of tables can be \begin_inset Quotes eld \end_inset cast \begin_inset Quotes erd \end_inset from them. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout library(reshape) \end_layout \begin_layout Plain Layout d1<-melt(SpAbun,id=c("Abbrev","Site","Core"),meas="Abun") \end_layout \begin_layout Plain Layout @ \end_layout \begin_layout Plain Layout \end_layout \end_inset \end_layout \begin_layout Standard The abbreviated names and the site and core numbers as identifiers. Identifiers are grouping variables. They are usually categorical variables or indices that identify what the measurements were taken on. The measurement that was taken here is abundance. The same logic can be used to melt data that has many measurements such as length, weight, biomass etc. \end_layout \begin_layout Standard Now the data can be aggregated by casting them using \emph on cast(d1,Site+Abbrev~variable,sum) \emph default . Because one of the identifiers (Core) has been left out of this formula there will be potentially multiple values for each combination of site and abbreviated name. The casting syntax includes the name of a function to apply to these multiple values that returns a single value. If you have used pivot tables in a spreadsheet you will have carried out a similar operation. Notice that when casting the identifiers placed to the left of the ~ symbol will be used to define row values and those to the right are used for column headings. The only real way to understand this is to look very carefully at the example. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout d2<-cast(d1,Site+Abbrev~variable,sum) \end_layout \begin_layout Plain Layout head(d2) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection Species by site matrix \end_layout \begin_layout Standard The species by site matrix using aggregated abundances for each site can be easily produced using the same approach. There are therefore just three lines of R code needed to convert the imported tables into a format you can work with (merge, melt and cast). The result can be written to a file and opened in a spreadsheet if required. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout SpMatrix<-cast(d1,Site~Abbrev,sum) \end_layout \begin_layout Plain Layout head(SpMatrix[,1:8]) \end_layout \begin_layout Plain Layout @ \end_layout \begin_layout Plain Layout \end_layout \end_inset \end_layout \begin_layout Subsubsection Total species abundances \end_layout \begin_layout Standard The same logic regarding reducing the number of combinations of identifiers can be used again in order to calculate total (or mean) abundance for each species. Notice that the second line sorts the result. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout SpTotals<-cast(d1,Abbrev~variable,sum) \end_layout \begin_layout Plain Layout SpTotals<-SpTotals[order(SpTotals$Abun,decreasing=T),] \end_layout \begin_layout Plain Layout SpTotals \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection Using only the most abundant species \end_layout \begin_layout Standard So now, say that we wanted to reduce the species by sites matrix to the twenty most abundant species. We have already ordered the totals so we can take the first twenty. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Top<-SpTotals$Abbrev[1:20] \end_layout \begin_layout Plain Layout Top \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard We can use an additional R trick. The %in% operator returns a list of true and false values. So if we subset the columns in the matrix in order to only include these we get the result we want. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout SpMatrix2<-SpMatrix[,names(SpMatrix)%in%Top] \end_layout \begin_layout Plain Layout head(SpMatrix2) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection Converting to density \end_layout \begin_layout Standard The data can be easily converted to individuals per square meter by calculating a conversion factor. The cores had a radius of 5cm, so their areas are given by 5²*pi. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout CF<-10000/(5*5^2*pi) \end_layout \begin_layout Plain Layout CF \end_layout \begin_layout Plain Layout SpMatrix2<-round(SpMatrix2*CF,0) \end_layout \begin_layout Plain Layout @ \end_layout \begin_layout Plain Layout \end_layout \end_inset \end_layout \begin_layout Subsubsection Joining the sites data. \end_layout \begin_layout Standard The sites data can be loaded into R from here. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Sites<-read.csv("http://tinyurl.com/QEData/PH/Sites.csv") \end_layout \begin_layout Plain Layout str(Sites) \end_layout \begin_layout Plain Layout head(Sites) \end_layout \begin_layout Plain Layout @ \end_layout \begin_layout Plain Layout \end_layout \end_inset \end_layout \begin_layout Standard We will finally join the smaller matrix of abundant species to the site data so that we can look at the relationship between algal cover and substrate at the site and density. The process is the same as we have already used. Providing the two data frames are of the same length (which they must be, because there is now only one row per site) and there is a single column with the same name which contains matched index data we can merge the two together. However we lost the \begin_inset Quotes eld \end_inset Site \begin_inset Quotes erd \end_inset column when we subsetted the matrix. The matrix is still in the same order, so we can add this back in. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout SpMatrix2<-data.frame(Site=1:80,SpMatrix2) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard Now all we need to do is merge the two together. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Sites<-merge(Sites,SpMatrix2) \end_layout \begin_layout Plain Layout head(Sites) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection Exporting to SPSS \end_layout \begin_layout Standard The data with all the columns aligned can now be exported in the csv format that can be read in any statistical package including SPSS \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout write.csv(Sites,file="PHinvertebrates.csv",row.names=F) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection Combining with sector data \end_layout \begin_layout Standard The distribution of birds within the harbour is based on the low tide WeBS counts. The number of birds quoted for each sector represents the average of counts conducted throughout the winter months of 2004/5. If we wish to look at relationships between invertebrate densities and bird distribution we therefore have to aggregate the densities to sector. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Birds<-read.csv("http://tinyurl.com/QEData/PH/Birds.csv") \end_layout \begin_layout Plain Layout str(Birds) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard We can first remove the elements from the data frame that are not suitable for aggregation. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Densities<-Sites[,-c(1,2,3,5,6)] \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard Now we can merge the data together. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Sectors<-merge(Birds,Densities) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard The aggregation uses the same logic as previously. This time we can just define the sector as the only identifier. All other variables are measurements that we wish to aggregate using the mean. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout Sectors<-melt(Sectors,id="sector") \end_layout \begin_layout Plain Layout Sectors<-cast(Sectors,sector~variable,mean) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Standard Again, we can save the results that now form a compact data frame for further analysis. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <<>>= \end_layout \begin_layout Plain Layout write.csv(Sectors,file="SectorMeans.csv",row.names=F) \end_layout \begin_layout Plain Layout @ \end_layout \end_inset \end_layout \begin_layout Subsubsection The relational data structure \end_layout \begin_layout Standard If you have used a relational data base at any time you might realise that the merge operations effectively allow R to implement what are known as \begin_inset Quotes eld \end_inset inner joins \begin_inset Quotes erd \end_inset in database terminology. The linkages between the four tables we have used could be implemented in a database using the structure shown below. Notice that there are several \begin_inset Quotes eld \end_inset one to many relationships \begin_inset Quotes erd \end_inset . \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status open \begin_layout Plain Layout \backslash setkeys{Gin}{width=5in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/DataBase1.png} \end_layout \end_inset \end_layout \begin_layout Standard It is worth drawing up a figure such as this before collecting the data. The table names, variable names and links between them should also be given as metadata. Notice that links represent the hierachical structure of the data. Tables that hold repeated measures (such as the abundance table) will be at the head of the links. \end_layout \begin_layout Standard The merge function in R can also implement so called \begin_inset Quotes eld \end_inset outer joins \begin_inset Quotes erd \end_inset , but this goes beyond the level needed here. \end_layout \begin_layout Subsubsection Exporting to a GIS \end_layout \begin_layout Standard The sites data has coordinates, so it is explicitly spatial data. R can work with spatial data using a range of tools, including advanced spatial statistics. The following lines first turn the dataframe into a spatial object by defining the coordinates. The coordinate reference system is then set (EPSG:27700= British National Grid). The data are then exported to an Esri shapefile called sites in the folder \begin_inset Quotes eld \end_inset data \begin_inset Quotes erd \end_inset . \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout <>= \end_layout \begin_layout Plain Layout library(rgdal) \end_layout \begin_layout Plain Layout coordinates(Sites)<-~X+Y \end_layout \begin_layout Plain Layout proj4string(Sites)<-CRS("+init=epsg:27700") \end_layout \begin_layout Plain Layout writeOGR(Sites,"data","sites",driver="ESRI Shapefile") \end_layout \begin_layout Plain Layout @ \end_layout \begin_layout Plain Layout \end_layout \end_inset \end_layout \begin_layout Standard This file can then be opened in any GIS. For example the screenshots below shows the file reprojected on the fly in QGIS and shown over a google maps layer. Pie charts of relative abundances can be produced using the diagram plugin. \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status open \begin_layout Plain Layout \backslash setkeys{Gin}{width=5in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/qgis.png} \end_layout \end_inset \end_layout \begin_layout Standard \noindent \align center \begin_inset ERT status open \begin_layout Plain Layout \backslash setkeys{Gin}{width=5in} \end_layout \begin_layout Plain Layout \backslash includegraphics{figs/qgis1.png} \end_layout \end_inset \end_layout \begin_layout Subsection Summary \end_layout \begin_layout Standard This lesson has covered a lot of ground. It will take time and experience to assimilate. The key points to remember are \end_layout \begin_layout Enumerate Keep your own raw data in a simple table with defined column headings. \end_layout \begin_layout Enumerate Do not mix calculations with raw data in a spreadsheet. \end_layout \begin_layout Enumerate Ecological survey data has a hierarchical structure which should be reflected in the way you store the data. \end_layout \begin_layout Enumerate Correctly structured data tables are infinitely extensible. Once a table has been defined it can hold any amount of data. \end_layout \begin_layout Enumerate If you find yourself capturing the same sort of information in separate parts of a spreadsheet rather than in one single table then you are doing something wrong. For example, starting a new sheet for each site will lead to data that cannot be easily processed. \end_layout \begin_layout Enumerate Unique identifiers should be used for each hierarchical level. Data can then be merged together in order to line up columns. Data tables must be provided with an identifying column with the same heading and containing the same ID codes. \end_layout \begin_layout Enumerate Data aggregation and reformatting can be achieved for the whole data set with just one or two well designed lines of R code. This can save many hours of work in Excel. \end_layout \begin_layout Enumerate Use the example given here as a model when planning your own data collection. Most ecological data will have a very similar sort of structure. The R code provided can be easily adapted to other data sets with a little thought. \end_layout \begin_layout Subsection Exercises \end_layout \begin_layout Enumerate Work through the steps shown in this document carefully, looking at the data in different ways (use fix or edit to pop up spreadsheets in R). \end_layout \begin_layout Enumerate Try saving the dataframes at various points (use \begin_inset Quotes eld \end_inset write.csv \begin_inset Quotes erd \end_inset ) and open the results in a spreadsheet. \end_layout \begin_layout Enumerate Import some of the results of the steps to SPSS and run some simple analyses there. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Plain Layout \backslash pagebreak \end_layout \end_inset \end_layout \end_body \end_document