Trying to sort out the headers

The biggest challeng is to uniquely identify the size class measurements in terms of site, core name, core position and species.

If the data had been held consitently this would be easy. However the use of two colimn headers on the measurement sheets make this very difficult.

I first thought of reading the rows in separately as vectors. The plan was to join the two text fields together using separators which culd then be removed to produce unique columns for site, core and taxon.

However this is not going to work as planned thanks to the use of merged cells in Excel. There are missing values throughout. Worse still, the actual cell that the value has been taken from is sometimes the first in a series, but can be a middle cell. If it were always the first cell there would not be a problems as the NAs can be filled with the preceding value, but in this case its not going to work.

library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
header1<- as.vector(t(read_excel("LGI September 2018 Abundance and Biometric Data.xlsx", 
    sheet = "MF Measurements Sept 2018", 
    col_names = FALSE, n_max = 1)))

header2<- as.vector(t(read_excel("LGI September 2018 Abundance and Biometric Data.xlsx", 
    sheet = "MF Measurements Sept 2018", 
    col_names = FALSE, n_max = 1,skip=1)))

df<-data.frame(cbind(header1,header2))
## Warning in cbind(header1, header2): number of rows of result is not a
## multiple of vector length (arg 1)
library(aqm)
## 
## Attaching package: 'aqm'
## The following object is masked from 'package:stats':
## 
##     dt
dt(df)

Ignoring the top row

library(readxl)
MF <- read_excel("LGI September 2018 Abundance and Biometric Data.xlsx", 
    sheet = "MF Measurements Sept 2018", 
    skip = 1)

## Need to make the names more consistent

names(MF)<-tolower(names(MF)) ## Change all to lower case
names(MF)<-gsub("[^\\p{L}']+", " ", names(MF), perl = TRUE) ## Strip anything that's not a letter or a space.
names(MF)<- gsub(" $","", names(MF), perl=T) ## Strip trailing spaces
names(MF)<- gsub(" ","_", names(MF), perl=T) ## Underscore for space
table(names(MF))
## 
##            abra_juv        capitellidae       cappitellidae 
##                   1                   9                   1 
##        cirratulidae         corophiidae     crangon_crangon 
##                  17                   6                   1 
##         hydrobiidae     macoma_balthica            nematoda 
##                  19                  25                   8 
##          nephtyidae          nereididae         oligochaeta 
##                  11                   7                  29 
##       phyllodocidae scrobicularia_plana          size_class 
##                   1                  19                   1 
##           spionidae                   x 
##                  21                   1
MF %>% pivot_longer(cols=c(-1)) %>% group_by(size_class,name) %>% summarise(n=sum(value,na.rm=TRUE)) %>% filter(n>0) -> MF
## Warning: Duplicate column names detected, adding .copy variable
MF %>% pivot_wider(names_from = c(name), values_from = n) %>%
dt()