The lack of attention to the general issue I aim to explain here has been puzzling me. Recording all potential causes of death and then explicitly stating both possible outcomes (death, or discharge from hospital) leads to a simple case control study design. Why are informative data sets impossible to find online?
I’ll try to lay out the problem simply.
The challenge posed by all synergistic processes is well known to ecologists. When we are studying complex systems we talk about the “multidimensional hyperspace”. If everything has an influence on everything else it is clearly impossible to ever understand or even properly define the role of any single variable in a process. All we can do is our best.
In order to keep such problems tractable higher order interactions are usually ignored. The simplest possible model is an additive one. This additive model is typically used for statistical analysis, even when interactions (conditional effects) are well known to be important. This is why ecologists distrust all our models. We know that they are wrong apriori!
As we don’t (yet) have any real data I have simulated some dummy data using a naive Bayesian network. The term “naive” refers to a structure which reduces conditional probabilities to a minimum. The arrows on the diagram point in a non causal direction. The probabilities defined are p(illness|outcome) rather than p(outcome|illness). If p(illness|death) > p(illness|discharge) the illness has an effect on overall mortality. If If p(illness|death) = p(illness|discharge) it does not.
The advantage of defining joint conditional probabilities using a naive bates format is that it reduces the dimensionality of the problem. It also allows the overall probability of the outcome to be defined. The proportion of long hospital stays that result in death is estimated to be around 2%.
Notice that without any evidence on outcome the network gives equal probabilities for all illnesses. On entering evidence these probabilities change, except in the case of Covid. In this simple thought experiment Covid has no effect on mortality. It is a neutral factor that combines additively with all other morbidities but which does not actually add to the outcome.
Data can be simulated from the Bayesian network in the form of cases with all available combinations of states as determined by the joint conditional probability table. Take into account the following
The Bayesian network was used to simulate 10,000 cases. The probability of being covid positive or covid negative was assumed to be 50:50. So the results can be summarised as a cross tabulation.
The case fatality ratio with covid is the same as the overall hospital stay case fatality, as covid has no effect.
table(d$Outcome,d$Covid)
##
## Covid No_covid
## Dies 1041 998
## Lives 48835 49126
This table of aggregated data can be analysed using a simple statistical test that every one has been taught in elementary maths classes.
chisq.test(table(d$Outcome,d$Covid))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(d$Outcome, d$Covid)
## X-squared = 1.1086, df = 1, p-value = 0.2924
The result is not significant. So the data needed to check whether covid has any effect is trivially simple. We just need to be able to form the cross tabulation!
Carl Henegan included an age structured table of hospital deaths in a recent tweet. This table included only those cases which tested positive for covid and died. The simulated data can be filtered to produce a similar table.
d %>% filter(Covid == "Covid") %>%
filter(Outcome == "Dies") %>%
group_by(Outcome,Diabetes,Heart_disease,Covid, Cancer) %>% summarise(n=n()) %>% arrange(-n) %>% dt()
## `summarise()` regrouping output by 'Outcome', 'Diabetes', 'Heart_disease', 'Covid' (override with `.groups` argument)
As the naive model produces additive effects the highest number of deaths are tabulated with all the co-morbidities. This is unrealistic. However notice that there are 32 covid deaths being reported under the naive model with no comorbidities!
Note that this is where the problem arises when only partial data are reported.
It is tempting to start speculating about numbers of deaths with covid and deaths from covid from these sort of partial data tables. However the data only tell part of the story. In this case it is misleading as the simulation assumed no deaths at all from covid. This does not imply that this is the case in the real world. It just means that we can’t really tell unless the rest of the case control study data is provided.
If the complete data set for hospital admissions include outcomes and co-morbidities were available it would be possible to build statistical models that take into account time dependency and stratification by hospital. This would need anonimised individual patient records in order to properly hold for age specific effects and to build a hierarchical model with the appropriate structure for random and fixed effects. This is quite a major undertaking (I would not be qualified to undertake it! )
In this trivially simple case a naive logistic regression matches the naive data set and throws up the same p-value as a chi squared test on the aggregate table!
This match between the over simplistic chi-squared test on a two way contingency table and a logistic regression on individual records surprised me at first. The reason behind it lies in the assumption of poisson distribution of counts that is implicit in the chi.square test. It applies in this naive case, but may be dangerous to generalise. However it does provide a rough and ready screen. The aggregated counts can also be modelled by glms assuming poisson distributions, so although there is a danger of omitting the ecological fallacy (Simpson’s paradox) with aggregate data it would at least be something to look at.
Where is the data?
d$Outcome<-ifelse(d$Outcome == "Dies",0,1)
mod<- glm(Outcome ~ .,data=d,family="binomial")
summary(mod)
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3374 0.1067 0.1618 0.2198 0.3313
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.87517 0.04126 69.685 < 2e-16 ***
## DiabetesNo_diabetes 0.83599 0.04896 17.074 < 2e-16 ***
## Heart_diseaseNo_heart_diesease 0.35174 0.04569 7.699 1.37e-14 ***
## CancerNo_cancer 1.45474 0.05707 25.493 < 2e-16 ***
## CovidNo_covid 0.04775 0.04505 1.060 0.289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19911 on 99999 degrees of freedom
## Residual deviance: 18698 on 99995 degrees of freedom
## AIC: 18708
##
## Number of Fisher Scoring iterations: 7
confint(mod)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 2.79493745 2.9566837
## DiabetesNo_diabetes 0.74059385 0.9325600
## Heart_diseaseNo_heart_diesease 0.26239473 0.4415167
## CancerNo_cancer 1.34412050 1.5678909
## CovidNo_covid -0.04053169 0.1360836