Chapter 2 Types of variables

The key to keeping data in a tidy format is to properly define the variables you will be using before collecting any data. This is a natural process if a traditional data base is used for data storage. However when using a spreadsheet it is not necessarily as straight forward.

There are two broad classes of variable.

  1. Categorical variables
  2. Numerical variables.

Within these classes some further distinctions can be made. Categorical variables can be binary (e.g. presence or absence, true or false), ordinal (e.g strength of feeling) or non-ordinal (e.g vegetation types)

Numerical variables can be subdivided into integers, numbers that represent an interval scale (e.g temperature) and those that represent a ratio scale. For data management purposes these distinctions are not particularly important. The key difference is that between categorical variables and numerical variables.

2.1 One variable, one column rules

If you follow three simple rules you can usually guarantee that your data will be tidy.

  1. All the values in any column must be the result of exactly the same measurement process.
  2. Aggregated data must not be mixed with raw data.
  3. There should be one (and only one) column for each measured variable.

These rules lead to the formation of a data frame. The best way to explain this is to look at what happens when the rules are violated.

d %>% pivot_wider(names_from =gender,values_from=wt)%>% select(-1) %>% datatable()