The problem

There lack of attention the issue I aim to explain here has been baffling, given that it is explicitly part of any case control study which involves confoounding variables.

I’ll try to lay out the problem as clearly as I can.

  1. Admmission to hospital for a stay of over one day is conditional on illness.
  2. As hosptials admit people who are ill the risk of dying during a long stay in hospital is much higher than for any other comparable period, although lower than it would be if not admitted given the illness.
  3. Hosptials keep records on admissions, discharges and deaths. So the data is available.
  4. There are four combinations of outcomes (dischaged, dead) and covid status (positive negative).
  5. The same four combinations of outcomes apply to all comorbities.
  6. Comorbities act in synergy to increase the probability of a negative outcome.
  7. In order to estimate the lethaility of SARS-Cov-2/Covid-19 (letś call it covid for short here and assume that illness is present) we need to hold for mortality associated with comorbitities without covid.
  8. Not all comorbities that lead to death are ever diagnosed. I don’t have the figure for the proportion, but it is not zero.

The challenge posed by all synergistic processes is well known. Ecologists studying complex systems talk about a “multidimensional hyperspace”. If everything has an influence on everything else it is impossible to fully understand or even properly define the role of any single variable acting alone.

In order to keep problems tractable most higher order interactions are ignored. The simplest possible model is an additive one. This is typically used in statistical analysis, even when interactions (condition effects) are known to occuur.

Simulating some data

As we don’t (yet) have real data I have simulated some dummy data using a naive bayesian network. The term “naive” refers to a structure which reduces conditional probabilites to a minimum. The arrows on the diagram point in a non causal direction as the probbilites defined are p(illness|outcome) rather than p(outcome|illness). If p(illness|death) > p(illness|discharge) the illness has an effect on overall mortality. If If p(illness|death) = p(illness|discharge) it does not.

The advantage of defining joint conditional probabilitiies using a naive bayes format is that it reduces the ddimensionality of the problem. It also allows the overall probability of the outcome to be defined. The proportion of long hospital stays that result in death is around 2%.

Notice that without any evidence on outcome the network gives equal probabilities for all illnesses. On entering evidece these probabilities change, except in the case of Covid. In this though experiment Covid has no effect on mortality. It is a neutral factor that can combine additively with other comorbities but which does not actually add to the outcome.

Data can be simulated from the Bayesian network in the form of cases with all available combinations of states as determined by the joint conditional probability table. Take into account the following

  1. No attempt has been made to acurately parameterise the model using genuie data. This is a heuristic thought tool only. Giving names to the comorbities may have been a mistake in this context as the numbers could be misintepreted. It may have been bettwer to have called them disease A, B; C etc with Covid as a mystery disease whose identity is inferred but not stated.
  2. The effects are additive. All four illnesses can be present, but this will be rare.
  3. If the model were to be acurately parameterised it could be initiated in a naive form then modified through link reversals and additions. This is time consuming and involves mental gymnastics. Bayesian netwroks are usually parameterised for machine learning simply by learning their structure and cpts from large data tables with multiple observations.
dd<-filter(d, Outcome == "Dies")

g1 = ggplot(dd,aes(x= Heart_disease)) +geom_bar()
g2 = ggplot(dd,aes(x=Covid)) +geom_bar()
g3 = ggplot(dd,aes(x=Diabetes)) +geom_bar()
g4 = ggplot(dd,aes(x=Cancer)) +geom_bar()
qecol::multiplot(g1,g2,g3,g4,cols=2)
## Loading required package: grid

d$Outcome<-ifelse(d$Outcome == "Dies",1,0)
mod<- glm(Outcome ~ .,data=d,family="binomial")
summary(mod)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3313  -0.2198  -0.1618  -0.1067   3.3374  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -2.87517    0.04126 -69.685  < 2e-16 ***
## DiabetesNo_diabetes            -0.83599    0.04896 -17.074  < 2e-16 ***
## Heart_diseaseNo_heart_diesease -0.35174    0.04569  -7.699 1.37e-14 ***
## CancerNo_cancer                -1.45474    0.05707 -25.493  < 2e-16 ***
## CovidNo_covid                  -0.04775    0.04505  -1.060    0.289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19911  on 99999  degrees of freedom
## Residual deviance: 18698  on 99995  degrees of freedom
## AIC: 18708
## 
## Number of Fisher Scoring iterations: 7
table(d$Outcome,d$Covid)
##    
##     Covid No_covid
##   0 48835    49126
##   1  1041      998
d %>% group_by(Outcome,Diabetes,Heart_disease,Covid, Cancer) %>% summarise(n=n()) ->d
## `summarise()` regrouping output by 'Outcome', 'Diabetes', 'Heart_disease', 'Covid' (override with `.groups` argument)
aqm::dt(d)
write.csv(d,"simulated_deaths.csv")