There lack of attention the issue I aim to explain here has been baffling, given that it is explicitly part of any case control study which involves confoounding variables.
I’ll try to lay out the problem as clearly as I can.
The challenge posed by all synergistic processes is well known. Ecologists studying complex systems talk about a “multidimensional hyperspace”. If everything has an influence on everything else it is impossible to fully understand or even properly define the role of any single variable acting alone.
In order to keep problems tractable most higher order interactions are ignored. The simplest possible model is an additive one. This is typically used in statistical analysis, even when interactions (condition effects) are known to occuur.
As we don’t (yet) have real data I have simulated some dummy data using a naive bayesian network. The term “naive” refers to a structure which reduces conditional probabilites to a minimum. The arrows on the diagram point in a non causal direction as the probbilites defined are p(illness|outcome) rather than p(outcome|illness). If p(illness|death) > p(illness|discharge) the illness has an effect on overall mortality. If If p(illness|death) = p(illness|discharge) it does not.
The advantage of defining joint conditional probabilitiies using a naive bayes format is that it reduces the ddimensionality of the problem. It also allows the overall probability of the outcome to be defined. The proportion of long hospital stays that result in death is around 2%.
Notice that without any evidence on outcome the network gives equal probabilities for all illnesses. On entering evidece these probabilities change, except in the case of Covid. In this though experiment Covid has no effect on mortality. It is a neutral factor that can combine additively with other comorbities but which does not actually add to the outcome.
Data can be simulated from the Bayesian network in the form of cases with all available combinations of states as determined by the joint conditional probability table. Take into account the following
dd<-filter(d, Outcome == "Dies")
g1 = ggplot(dd,aes(x= Heart_disease)) +geom_bar()
g2 = ggplot(dd,aes(x=Covid)) +geom_bar()
g3 = ggplot(dd,aes(x=Diabetes)) +geom_bar()
g4 = ggplot(dd,aes(x=Cancer)) +geom_bar()
qecol::multiplot(g1,g2,g3,g4,cols=2)
## Loading required package: grid
d$Outcome<-ifelse(d$Outcome == "Dies",1,0)
mod<- glm(Outcome ~ .,data=d,family="binomial")
summary(mod)
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3313 -0.2198 -0.1618 -0.1067 3.3374
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.87517 0.04126 -69.685 < 2e-16 ***
## DiabetesNo_diabetes -0.83599 0.04896 -17.074 < 2e-16 ***
## Heart_diseaseNo_heart_diesease -0.35174 0.04569 -7.699 1.37e-14 ***
## CancerNo_cancer -1.45474 0.05707 -25.493 < 2e-16 ***
## CovidNo_covid -0.04775 0.04505 -1.060 0.289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19911 on 99999 degrees of freedom
## Residual deviance: 18698 on 99995 degrees of freedom
## AIC: 18708
##
## Number of Fisher Scoring iterations: 7
table(d$Outcome,d$Covid)
##
## Covid No_covid
## 0 48835 49126
## 1 1041 998
d %>% group_by(Outcome,Diabetes,Heart_disease,Covid, Cancer) %>% summarise(n=n()) ->d
## `summarise()` regrouping output by 'Outcome', 'Diabetes', 'Heart_disease', 'Covid' (override with `.groups` argument)
aqm::dt(d)
write.csv(d,"simulated_deaths.csv")