The problem

The lack of attention to the general issue I aim to explain here has been puzzling me. Recording all potential causes of death and then explicitly stating both possible outcomes (death, or discharge from hospital) leads to a simple case control study design. Why are informative data sets impossible to find online?

I’ll try to lay out the problem simply.

  1. Admission to hospital for a stay of over one day is conditional on illness. You don’t go into hospital if you are well.
  2. As hospitals admit people who are ill, the risk of dying during any stay in hospital is higher than for other comparable periods in a person’s lifetime, although it is still lower than it would be if they were not admitted in order to be treated for the illness.
  3. Hospitals do keep very complete records on admissions, discharges and deaths. So the data must be available, if only in an aggegate form.
  4. There are four possible combinations. Outcomes(discharged vs dead) and covid status (positive vs negative).
  5. The same four combinations of outcomes apply to all other co-morbidities.
  6. Comorbidities act in synergy to increase the probability of a negative outcome.
  7. In order to estimate the lethality of SARS-Cov-2/Covid-19 (letś call it covid for short here and assume that illness is present) we need to hold for mortality associated with all comorbitities.
  8. Not all comorbidity that lead to death are ever diagnosed. The proportion of deaths through undiagnsosed addtional illness can not be zero.

The challenge posed by all synergistic processes is well known to ecologists. When we are studying complex systems we talk about the “multidimensional hyperspace”. If everything has an influence on everything else it is clearly impossible to ever understand or even properly define the role of any single variable in a process. All we can do is our best.

In order to keep such problems tractable higher order interactions are usually ignored. The simplest possible model is an additive one. This additive model is typically used for statistical analysis, even when interactions (conditional effects) are well known to be important. This is why ecologists distrust all our models. We know that they are wrong apriori!

Simulating some data

As we don’t (yet) have any real data I have simulated some dummy data using a naive Bayesian network. The term “naive” refers to a structure which reduces conditional probabilities to a minimum. The arrows on the diagram point in a non causal direction. The probabilities defined are p(illness|outcome) rather than p(outcome|illness). If p(illness|death) > p(illness|discharge) the illness has an effect on overall mortality. If If p(illness|death) = p(illness|discharge) it does not.

The advantage of defining joint conditional probabilities using a naive bates format is that it reduces the dimensionality of the problem. It also allows the overall probability of the outcome to be defined. The proportion of long hospital stays that result in death is estimated to be around 2%.

Notice that without any evidence on outcome the network gives equal probabilities for all illnesses. On entering evidence these probabilities change, except in the case of Covid. In this simple thought experiment Covid has no effect on mortality. It is a neutral factor that combines additively with all other morbidities but which does not actually add to the outcome.

Data can be simulated from the Bayesian network in the form of cases with all available combinations of states as determined by the joint conditional probability table. Take into account the following

  1. No attempt has been made to accurately parametrise the model using genuine data. This is a heuristic thought tool only. Giving names to the co-morbidities may have been a mistake in this context as the numbers could easily be misinterpreted. It may have been better to have called them just disease A, B; C etc with Covid as a mystery disease whose identity is inferred but not stated.
  2. The effects are additive. All four illnesses can be present, but this will be rare.
  3. If the model were to be accurately parametrised it could be initiated in a naive form then modified through link reversals and additions. This is time consuming and involves mental gymnastics. Bayesian networks are usually parametrised for machine learning simply by learning their structure and cpts from large data tables with multiple observations.
  4. As the model is additive it will inevitably, and unrealistically lead to higher numbers of deaths with multiple co-comorbidities than with single co-comorbidities. This could be altered through explicit changes to the structure, but again this is for illustrative purposes only.
  5. Age itself could have been included in the model in the form of discrete classes. However the assumption is being made that the model applies to the explicitly vulnerable older age class with multiple co-morbidities.

The Bayesian network was used to simulate 10,000 cases. The probability of being covid positive or covid negative was assumed to be 50:50. So the results can be summarised as a cross tabulation.

The case fatality ratio with covid is the same as the overall hospital stay case fatality, as covid has no effect.

table(d$Outcome,d$Covid)
##        
##         Covid No_covid
##   Dies   1041      998
##   Lives 48835    49126

This table of aggregated data can be analysed using a simple statistical test that every one has been taught in elementary maths classes.

chisq.test(table(d$Outcome,d$Covid))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(d$Outcome, d$Covid)
## X-squared = 1.1086, df = 1, p-value = 0.2924

The result is not significant. So the data needed to check whether covid has any effect is trivially simple. We just need to be able to form the cross tabulation!

Conditioning on death and covid

Carl Henegan included an age structured table of hospital deaths in a recent tweet. This table included only those cases which tested positive for covid and died. The simulated data can be filtered to produce a similar table.

d %>% filter(Covid == "Covid") %>%
filter(Outcome == "Dies") %>% 
  group_by(Outcome,Diabetes,Heart_disease,Covid, Cancer) %>% summarise(n=n()) %>% arrange(-n) %>% dt()
## `summarise()` regrouping output by 'Outcome', 'Diabetes', 'Heart_disease', 'Covid' (override with `.groups` argument)

As the naive model produces additive effects the highest number of deaths are tabulated with all the co-morbidities. This is unrealistic. However notice that there are 32 covid deaths being reported under the naive model with no comorbidities!

Note that this is where the problem arises when only partial data are reported.

It is tempting to start speculating about numbers of deaths with covid and deaths from covid from these sort of partial data tables. However the data only tell part of the story. In this case it is misleading as the simulation assumed no deaths at all from covid. This does not imply that this is the case in the real world. It just means that we can’t really tell unless the rest of the case control study data is provided.

Statistical modelling

If the complete data set for hospital admissions include outcomes and co-morbidities were available it would be possible to build statistical models that take into account time dependency and stratification by hospital. This would need anonimised individual patient records in order to properly hold for age specific effects and to build a hierarchical model with the appropriate structure for random and fixed effects. This is quite a major undertaking (I would not be qualified to undertake it! )

In this trivially simple case a naive logistic regression matches the naive data set and throws up the same p-value as a chi squared test on the aggregate table!

This match between the over simplistic chi-squared test on a two way contingency table and a logistic regression on individual records surprised me at first. The reason behind it lies in the assumption of poisson distribution of counts that is implicit in the chi.square test. It applies in this naive case, but may be dangerous to generalise. However it does provide a rough and ready screen. The aggregated counts can also be modelled by glms assuming poisson distributions, so although there is a danger of omitting the ecological fallacy (Simpson’s paradox) with aggregate data it would at least be something to look at.

Where is the data?

d$Outcome<-ifelse(d$Outcome == "Dies",0,1)
mod<- glm(Outcome ~ .,data=d,family="binomial")
summary(mod)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3374   0.1067   0.1618   0.2198   0.3313  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     2.87517    0.04126  69.685  < 2e-16 ***
## DiabetesNo_diabetes             0.83599    0.04896  17.074  < 2e-16 ***
## Heart_diseaseNo_heart_diesease  0.35174    0.04569   7.699 1.37e-14 ***
## CancerNo_cancer                 1.45474    0.05707  25.493  < 2e-16 ***
## CovidNo_covid                   0.04775    0.04505   1.060    0.289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19911  on 99999  degrees of freedom
## Residual deviance: 18698  on 99995  degrees of freedom
## AIC: 18708
## 
## Number of Fisher Scoring iterations: 7
confint(mod)
## Waiting for profiling to be done...
##                                      2.5 %    97.5 %
## (Intercept)                     2.79493745 2.9566837
## DiabetesNo_diabetes             0.74059385 0.9325600
## Heart_diseaseNo_heart_diesease  0.26239473 0.4415167
## CancerNo_cancer                 1.34412050 1.5678909
## CovidNo_covid                  -0.04053169 0.1360836