Combine the data with this years.
d2<-read.csv("data/LondonGatewayDataFINAL_2017.CSV")
d2$Date<-as.Date(d2$Date, "%d/%m/%Y")
## Give the sites consistent names
d2$Site<-as.character(d2$Site)
d2$Site[grepl("Mucking",d2$Site)]<-"NM" ##All the Muckings are called NM as they are North Mucking in the historical data
d2$Site[grepl("SiteA",d2$Site)]<-"SW"
d2$Site[grepl("SiteX",d2$Site)]<-"SF"
d2$Site<-as.factor(d2$Site)
d2$Year<-d2$Winter
d2$month<-d2$Month
d2$Month<-month.name[d2$Month]
This table contains the data for historic winter peak counts for the 19 target species listed in Clause 10.5.4.
I am pretty much convinced that the Dunlin counts were the maximums, not the low water counts, in past surveys. There is no evidence of this in the report, but as the dunline counts are generally high, and using this years methodology that divides the counts by tide leads to an odd reduction in the counts if this is not accounted for I’ll run with this assumption.
d3<-read.csv("data/df_2018.csv")
## Take maximum of paired tide readings (not just the low, as Dunlins)
d3 %>% group_by(Site,Species,month) %>% summarise(value=max(value)) -> d3
data1<-data.frame(Site=d3$Site,Species=d3$Species,Year=2017,Month=d3$month,value=d3$value)
data2<-data.frame(Site=d2$Site,Species=d2$Species,Year=d2$Year,Month=d2$Month,value=d2$Number)
data<-rbind(data1,data2)
The question that arises is how were the historic winter maximums actually calculated? Does this reprodice the data in the report.
data %>% group_by(Year,Month,Species) %>% summarise(msum=sum(value)) %>% group_by(Year,Species) %>% summarise(Peak=max(msum)) -> peaks
peaks<-as.data.frame(peaks)
## Excel friendly spread out table
peaks %>% filter(Year< 2004) %>% spread(Year,Peak) %>% dt()
OK. So this is exactly the same as table 3.1. So this must be how Leo calculated the peak abundances of the species. Now we’ll come on to the assemblage
Just filtering out the key species and comparing to the sum from table 1 DOES NOT equate to the data in table 3.2 of the report. The total assemblage number being reported is not the sum of all the peak abundances.
peaks %>% group_by(Year) %>% summarise(Sum=sum(Peak)) -> assemblage
assemblage<-data.frame(Year=assemblage$Year,Species="Assemblage",Peak=assemblage$Sum)
hp <-rbind(peaks,assemblage)
key_species<-c("Avocet","Black-tailed godwit","Dunlin","Assemblage")
hp %>% filter(Species %in% key_species) %>% spread(Year,Peak) %>% dt()
Let’s try the other approach. Find the month with the maximum total assemblage.
data %>% group_by(Year,Month) %>% summarise(sum=sum(value)) %>% group_by(Year) %>% summarise(Month=Month[which(sum==max(sum))],Assemblage=max(sum)) -> peak_month
dt(peak_month)
Now the peak numbers coincide again. So Leo in fact used a hybrid method. The peak counts for the individual species are the peak observations for any month over the winter. However the assemblage count is the total for the month with the peak observation of all the species. This was NOT what I was doing. I was taking the species counts themselves from the month with the peak count.
This now reproduces Leo’s table 3.2 exactly (apart from the lower section in which propoptions are calculated)
assemblage<-data.frame(Year=peak_month$Year,Species="Assemblage",Peak=peak_month$Assemblage)
hp <-rbind(peaks,assemblage)
key_species<-c("Avocet","Black-tailed godwit","Dunlin","Assemblage")
hp %>% filter(Species %in% key_species) %>% spread(Year,Peak) %>% dt()
The problem with this is that if the proportional contribution of all the species to the total assemblage were to be added together it would be over 100%. This was not obvious, because in table 3.2 the proportional contribution was limited to three species. I am not sure what the best solution to this might be as the original report mixed it up.
Notice that this explains why the stacked bar charts do not lead to the same pattern. In actual fact stacked bar charts cannot be used to show these data in this form, as the numbers don’t actually stack up to the peak assemblage.
The only clean solution to this would seem to be to produce two data sets. One using the recorded values for the species during the peak assemblage month as the values. This way they will stack to the assemblage value. The other sums the peak abundances that Leo used. In this case the assemblage value will differ from that included in last year’s report.
data$Year_Month<-paste(data$Year,data$Month)
peak_year_months<-paste(peak_month$Year,peak_month$Month)
data %>% filter(Year_Month %in% peak_year_months) %>%
group_by(Year,Species) %>% summarise(Sum=sum(value)) -> peak_species
peak_species %>% spread(Year,Sum,fill=0) %>% dt()
peak_species %>% group_by(Year) %>% summarise(Sum=sum(Sum))
## # A tibble: 14 x 2
## Year Sum
## <dbl> <dbl>
## 1 1999 8606
## 2 2000 8395
## 3 2001 4190
## 4 2002 10263
## 5 2007 8635
## 6 2008 6869
## 7 2010 2554
## 8 2011 4115
## 9 2012 8666
## 10 2013 8213
## 11 2014 7410
## 12 2015 3338
## 13 2016 10106
## 14 2017 14377
So, a stacked boxplot based on table one would work, but would be different to the results in the previous year’s report.
write.csv(peak_species,file="data/table1.csv",row.names = FALSE)
In the alternative case I will use the method thar Leo did to obtain the peak species counts, which leads to the sum of the counts becoming the assemblage total. The hybrid version between table one and table two was used in the original report.
data %>% group_by(Year,Month,Species) %>% summarise(msum=sum(value)) %>% group_by(Year,Species) %>% summarise(Sum=max(msum)) -> species_peaks
species_peaks %>% spread(Year,Sum,fill=0) %>% dt()
write.csv(species_peaks,file="data/table2.csv",row.names=FALSE)
Now the totals will not be the same as the previous report, but they are consistent with the methodology of taling the peak abundance in any one month as a measure.
species_peaks %>% group_by(Year) %>% summarise(Sum=sum(Sum))
## # A tibble: 14 x 2
## Year Sum
## <dbl> <dbl>
## 1 1999 9284
## 2 2000 8628
## 3 2001 5812
## 4 2002 10766
## 5 2007 10355
## 6 2008 8887
## 7 2010 2554
## 8 2011 4994
## 9 2012 13718
## 10 2013 11455
## 11 2014 10983
## 12 2015 4267
## 13 2016 15231
## 14 2017 21112
The “hybrid”" analysis presented in the previous report is potentially rather confusing. This is may be the result of the monitoring and data analysis being tied to protocols that were established before consideration of their relationship with the avaailable data. However it might be better to base analyses on either table one or table two as formed from the raw data.