Data formats

The nature of data

You have seen two types of spatial data so far. Vector data and raster data. Vector data has an attribute table holding values of other variables. The same general structure is used for any data table (data frame) in R. However if you are capturing data “by hand” you need to be careful to format your data table correctly.

Spreadsheets

Spreadsheets are commonly used to both collect and analyse data from experiments and field studies. This is OK as you can save a spreadsheet in the generic comma seperated variable (CSV) format. Such files can be read by R and any other statistical software. The problem with spreadsheets is that they do not impose any structure on your data. It is your responsibility to ensure consistency.

Over complicating matters with spreadsheets

It is never necessary to store a raw data table on more than one sheet. If you find that you have done this you must think clearly about how to place all the data on the same sheet. This sheet should have column headers on the first row followed by measurements which are all in line with the column headers.

Some simple rules to follow

If you follow the following simple rules you should be able to produce data files from a spreadsheet that read cleanly and consistently into R. If you break the rules you'll find that analyses don't work the way you intended.

Rule 1

Make sure that the column headers consist of only one word. No spaces or unusual characters. Make the column headers short and informative. Don't include units or other descriptors. You can annotate plots later. If you don't follow this rule the file may not load. Be especially careful to avoid having “loose” observations or values typed into cells to the right of the headers. Keep good notes on units and metadata elsewhere.

Rule 2

Make sure that columns intended to hold numerical values ONLY hold numerical values. No letters or annotations. If you don't follow this rule you will find that the columns have been interpreted as factors (i.e. grouping variables) rather than measurements.

Rule 3

Make sure that there are no missing values in the file. If you can't avoid this, use the two letters NA to represent missing values. There is no guarantee that an analysis will run correctly with missing values and they should be avoided. A well conducted study does not have missing values. You can sometimes avoid problems by “imputing” a likely value for a missing value. Make sure you document this.

Rule 4

Use one column per variable. Do not spread measurements of the same type over several columns. Think of the example of a data table that is generated by a data logger or GPS. There is one column header for each variable and new data just keeps being added to the end of the same table

Rule 5

If you do want a column to be interpreted as a factor (categorical variable) make sure that you use some letters in the values. If you only use numbers it will be assumed to be numeric. So use something like Site1 or S1 rather than 1, 2 3. if you want to group data by site.

Rule 6

You must not include anything other than raw data in the sheet to be converted into CSV format. No calculated subtotals. No figures. No spaces between rows.

The one column per variable rule

One of the commonest data formatting problems arises when mutiple columns are used to contain measurements on the same variable. This “wide” format often comes about in field work when data is collected with one column per site.

The wrong way

	site_1	site_2	site_3	site_4
1	7	7	8	6
2	2	9	7	11
3	6	13	11	2
4	9	11	9	2
5	2	10	7	8
6	1	13	13	2

What's the problem?

This seems natural, but in fact it breaks the one colum per variable rule.Site is in fact a categorical variable (factor) with four levels. If you use site as a column header you will end up with multiple sheets, one for each measurement. This quickly gets very messy.

The right way

	site	measure
1	site_1	7
2	site_1	2
3	site_1	6
4	site_1	9
5	site_1	2
6	site_1	1

What about the calculations per site?

The reason you probably started to set up a column per site was because you were thinking of carrying out calculations on the bottom row or side column of the spreadsheet (margins). This is natural in Excel, but not the way calculations work in database or statistical package.

What about the calculations per site?

It is very easy to work with data in the long format in R, as instead of adding calculations at the bottom or sides we use the factor as a grouping variable.

Example

library(dplyr)
d %>% group_by(site) %>% summarise(mean=mean(measure),sd= sd(measure)) ->dd

	site	mean	sd
1	site_1	5.10	2.69
2	site_2	10.40	2.67
3	site_3	9.80	2.66
4	site_4	6.30	3.30

Advantages of this method

The key advantage of this is that the output of the analysis is also a data table. So it is also in a suitable format for further analysis.

Advantages of this method

It is also easy to make plots.

library(ggplot2)
g0<-ggplot(d,aes(x=site,y=measure))
g0+geom_boxplot()

plot of chunk unnamed-chunk-5