Informatively empty clusters with application to multigenerational studies

Recently, Kioumourtzoglou examined the impact of in-utero DES exposure among nurses on attention-deficit/hyperactivity disorder (ADHD) in their children

2019 Summary

Exposures with multigenerational effects have profound implications for public health, affecting increasingly more people as the exposed population reproduces. Multigenerational studies, however, are susceptible to informative cluster size, occurring when the number of children to a mother (the cluster size) is related to their outcomes, given covariates. A natural question then arises: what if some women bear no children at all? The impact of these potentially informative empty clusters is currently unknown.

This article first evaluates the performance of standard methods for informative cluster size when cluster size is permitted to be zero. We find that if the informative cluster size mechanism induces empty clusters, standard methods lead to biased estimates of target parameters. Joint models of outcome and size are capable of valid conditional inference as long as empty clusters are explicitly included in the analysis, but in practice empty clusters regularly go unacknowledged. In contrast, estimating equation approaches necessarily omit empty clusters and therefore yield biased estimates of marginal effects.

To resolve this, we propose a joint marginalized approach that readily incorporates empty clusters and even in their absence permits more intuitive interpretations of population-averaged effects than do current methods. Competing methods are compared via simulation and in a study of the impact of in-utero exposure to diethylstilbestrol on the risk of attention-deficit/hyperactivity disorder (ADHD) among 106 198 children to 47 540 nurses from the Nurses Health Study.

Study population

The proposed methods are motivated by a study of the effect of diethylstilbestrol exposure on thirdgeneration ADHD diagnosis in the Nurses Health Study II. The data consist of K=61 485 female nurses aged 25–42 in 1989 who returned a series of questionnaires in subsequent years and had no multiple sameyear births. In 2005 and 2013, nurses reported whether their children had been diagnosed with ADHD and analysis is restricted to concordant responses. The data are hierarchical in nature, with N = 106 198 children clustered within families identified by their mothers (nurses).

A key feature of the data is that cluster size (number of children) is potentially informative, as seen in Table 3: ADHD prevalence ranged from 5.62% in only-children to 3.22% in children from families of five or more children. Some of this relationship may be due to diethylstilbestrol exposure, whose rate was highest for nurses with no children (2.79%) and decreased to 1.18% for those with five or more children. Critically, 23% of nurses reported no live births and were thus excluded from previous analyses. To explore the impact of this decision on the conclusions of the analysis, we now consider the full population of nurses that met the eligibility criteria, this time including those without children.


The primary aim of the study was to quantify the effect of diethylstilbestrol on third-generation ADHD diagnosis, and we compared results of each analysis approach considered in the simulations. Logistic outcome models were adjusted for nurse’s exposure to diethylstilbestrol, smoking status, and year of birth. For the joint models, we modeled cluster size using a zero-inflated Poisson model (where the Poisson component adjusted for the same covariates and the zero inflation adjusted for exposure) in order to permit informative and non-informative emptiness. For the joint model that ignores empty clusters, we assumed a Poisson distribution, with a minimum size of one. We adopted a random intercepts model with exposure-dependent variance (as in the simulations), permitting correlation to depend on diethylstilbestrol exposure.


Estimates of marginal parameters can be found in Table 4. Diethylstilbestrol had a moderate adverse population-averaged (marginal) effect onA DHD risk, and estimates varied only somewhat across analyses: the independence estimating equations odds ratio estimate was 1.46 [95% confidence interval (CI) (1.19–1.78)] and was slightly larger than the cluster size weighted estimating equations estimate of 1.39 (1.13–1.71). Because these estimates are consistent for distinct parameters only under informative cluster size, these results (in light of the large sample size) suggest weak informativeness. As such, emptiness did not seem to have a large impact here, and the joint marginalized estimate fell between those of the estimating equations [1.41; 95% CI (1.14–1.73)].

Conditional parameter estimates can be found in Tables 5. The cluster-specific (conditional) estimates of the exposure-ADHD odds ratio were naturally much larger, but still varied little across analyses, ranging from 2.39 (1.38–4.12) under the outcome-only GLMM to 2.33 (1.33–4.06) under the complete joint model. The other covariate-outcome associations varied negligibly across conditional analyses.

Despite discrepant levels of correlation by exposure level (σ0 and σ1 are estimated to be 2.03 and 1.66 under the joint model), the variation in exposure effects across analyses is modest. This is because although there was a strong potential for informative cluster size (see Table 3), the actual level of informativeness was low (the estimate of the scaling parameter for the unexposed was −0.01).



Have your say ! Share your views

This site uses Akismet to reduce spam. Learn how your comment data is processed.