Keeping data tidy:Guidelines for research projects
2019-11-22
Chapter 1 Introduction
1.1 Why write these guidelines?
The guidelines have been written as a result of years of experience helping students to analyse data for research projects and dissertations. Many courses on data analysis and statistics assume that the data is already available in a well designed, conventional format, all ready for analysis. However this is rarely the case in the real world. Supervisors and tutors often assume that students intuitively know how to collect and maintain data. However relying on common sense and intuition alone is just not enough.
The use of spreadsheets for data analysis causes the problem. There is nothing intrinsically wrong with keeping data in a spreadsheet. However there are many potential hazards involved. Assuming spreadsheets can be used without any training is dangerous.
1.2 Tidy vs untidy data
Large quantities of useful data are stored in a naturally tidy format. All electronic devices designed to generate and store data will produce tidy data. A GPS produces tidy data in the form of a .gpx file. A data logger or automated weather station produces tidy data. However, as more and more quantitative data is generated and stored on the internet the ratio of the number of files consisting of untidy data to the number of files containing tidy data increases. There is a reason for this. When data values are generated in a tidy format they can simply be added to the end of a pre-existing table in a data base. However untidy data that does not follow any conventional format and so cannot be combined easily with pre-existing data files.
Untidy data leads to a feeling of data overload. It is bad for your mental health!
1.3 The bottom line: Problems with spreadsheets
Spreadsheets were designed originally as a tool for accounting, not scientific data management. When spreadsheets were first developed no one imagined that they would be used for managing large amounts of data. Excel was sold bundled with the MSAccess data base program under the assumption that data would be held in Access and only analysed and presented in Excel. However times have changed. Spreadsheets can now hold many more rows of data than in the past. They have become the tools of choice for most data capture and management. This does not have to lead to problems, but in reality it often does. The key difference between a spreadsheet and a traditional data base lies lies in the fact that spreadsheets do not impose any structure on the data format. This is why they became popular in the first place. It is very easy to type data into a spreadsheet, or cut and paste data from a range of sources straight into a table. That is not possible in a traditional data base. A traditional data base imposes a rigid structure on the data. The type of data held in each field is pre-defined. The first data bases even insisted on the user predefining the maximum number of characters or bytes used in order to ensure disk space was not wasted.
Figure 1.1 is a humorous demonstration of the potential problems with a spreadsheet.
Figure 1.1 is clearly a joke, but it does make a serious point. When data is entered into a spreadheet there is no built in sanity check to ensure that the units of measurement are compatible. There is not even any check to ensure that numerical values are not mixed with text. Calculations add a bottom line to a column of data, but these derived values represent a
hist(rnorm(100))