Outcome Modelling Methods


Kern, H. L., Stuart, E. A., Hill, J., & Green, D. P. (2016). Assessing methods for generalizing experimental impact estimates to target populations. Journal of Research on Educational Effectiveness, 9(1), 103–127.

  • Randomized experiments are considered the gold standard for causal inference because they can provide unbiased estimates of treatment effects for the experimental participants. However, researchers and policymakers are often interested in using a specific experiment to inform decisions about other target populations. In education research, increasing attention is being paid to the potential lack of generalizability of randomized experiments because the experimental participants may be unrepresentative of the target population of interest.
  • This article examines whether generalization may be assisted by statistical methods that adjust for observed differences between the experimental participants and members of a target population. The methods examined include approaches that reweight the experimental data so that participants more closely resemble the target population and methods that utilize models of the outcome. Two simulation studies and one empirical analysis investigate and compare the methods’ performance. One simulation uses purely simulated data while the other utilizes data from an evaluation of a school-based dropout prevention program.
  • Our simulations suggest that machine learning methods outperform regression-based methods when the required structural (ignorability) assumptions are satisfied. When these assumptions are violated, all of the methods examined perform poorly. Our empirical analysis uses data from a multisite experiment to assess how well results from a given site predict impacts in other sites. Using a variety of extrapolation methods, predicted effects for each site are compared to actual benchmarks. Flexible modeling approaches perform best, although linear regression is not far behind.
  • Taken together, these results suggest that flexible modeling techniques can aid generalization while underscoring the fact that even state-of-the-art statistical techniques still rely on strong assumptions.

Orr, L. L., Olsen, R. B., Bell, S. H., Schmid, I., Shivji, A., & Stuart, E. A. (2019). Using the results from rigorous multisite evaluations to inform local policy decisions. Journal of Policy Analysis and Management, 38(4), 978–1003.

  • Evidence-based policy at the local level requires predicting the impact of an intervention to inform whether it should be adopted. Increasingly, local policymakers have access to published research evaluating the effectiveness of policy interventions from national research clearinghouses that review and disseminate evidence from program evaluations. Through these evaluations, local policymakers have a wealth of evidence describing what works, but not necessarily where.
  • Multisite evaluations may produce unbiased estimates of the average impact of an intervention in the study sample and still produce inaccurate predictions of the impact for localities outside the sample for two reasons: (1) the impact of the intervention may vary across localities, and (2) the evaluation estimate is subject to sampling error. Unfortunately, there is relatively little evidence on how much the impacts of policy interventions vary from one locality to another and almost no evidence on the implications of this variation for the accuracy with which the local impact of adopting an intervention can be predicted using findings from an evaluation in other localities.
  • In this paper, we present a set of methods for quantifying the accuracy of the local predictions that can be obtained using the results of multisite randomized trials and for assessing the likelihood that prediction errors will lead to errors in local policy decisions.
  • We demonstrate these methods using three evaluations of educational interventions, providing the first empirical evidence of the ability to use multisite evaluations to predict impacts in individual localities—i.e., the ability of “evidence-based policy” to improve local policy.

Robertson, S. E., Steingrimsson, J. A., Joyce, N. R., Stuart, E. A., & Dahabreh, I. J. (2021). Estimating subgroup effects in generalizability and transportability analyses. ArXiv Preprint ArXiv:2109.14075.

  • Methods for extending – generalizing or transporting – inferences from a randomized trial to a target population involve conditioning on a large set of covariates that is sufficient for rendering the randomized and non-randomized groups exchangeable. Yet, decision-makers are often interested in examining treatment effects in subgroups of the target population defined in terms of only a few discrete covariates.
  • Here, we propose methods for estimating subgroup-specific potential outcome means and average treatment effects in generalizability and transportability analyses, using outcome model-based (g-formula), weighting, and augmented weighting estimators. We consider estimating subgroup-specific average treatment effects in the target population and its non-randomized subset, and provide methods that are appropriate both for nested and non-nested trial designs.
  • As an illustration, we apply the methods to data from the Coronary Artery Surgery Study to compare the effect of surgery plus medical therapy versus medical therapy alone for chronic coronary artery disease in subgroups defined by history of myocardial infarction.

Verde, P. E., Ohmann, C., Morbach, S., & Icks, A. (2016). Bayesian evidence synthesis for exploring generalizability of treatment effects: A case study of combining randomized and non-randomized results in diabetes. Statistics in Medicine, 35(10), 1654–1675.

  • In this paper, we present a unified modeling framework to combine aggregated data from randomized controlled trials (RCTs) with individual participant data (IPD) from observational studies. Rather than simply pooling the available evidence into an overall treatment effect, adjusted for potential confounding, the intention of this work is to explore treatment effects in specific patient populations reflected by the IPD. In this way, by collecting IPD, we can potentially gain new insights from RCTs’ results, which cannot be seen using only a meta-analysis of RCTs.
  • We present a new Bayesian hierarchical meta-regression model, which combines submodels, representing different types of data into a coherent analysis. Predictors of baseline risk are estimated from the individual data. Simultaneously, a bivariate random effects distribution of baseline risk and treatment effects is estimated from the combined individual and aggregate data. Therefore, given a subgroup of interest, the estimated treatment effect can be calculated through its correlation with baseline risk.
  • We highlight different types of model parameters: those that are the focus of inference (e.g., treatment effect in a subgroup of patients) and those that are used to adjust for biases introduced by data collection processes (e.g., internal or external validity). The model is applied to a case study where RCTs’ results, investigating efficacy in the treatment of diabetic foot problems, are extrapolated to groups of patients treated in medical routine and who were enrolled in a prospective cohort study.

Weisberg, H. I., Hayden, V. C., & Pontes, V. P. (2009). Selection criteria and generalizability within the counterfactual framework: Explaining the paradox of antidepressant-induced suicidality? Clinical Trials (London, England), 6(2), 109–118.

  • BACKGROUND: Although the superior internal validity of the randomized clinical trial (RCT) is invaluable to establish causality, generalizability is far from guaranteed. In particular, strict selection criteria intended to maximize treatment efficacy and safety can impair external validity. This problem is widely acknowledged in principle but sometimes ignored in practice, with considerable consequences for treatment options.
  • PURPOSE: We demonstrate how selection of patients for an RCT can bias the results when the treatment effect varies across individuals. Indeed, not only the magnitude, but even the direction of the causal effect found in an RCT can differ from the causal effect in the target population.
  • METHODS: A counterfactual model is developed to represent the selection process explicitly. This simple extension of the standard counterfactual model is used to explore the implications of restrictive exclusion criteria intended to eliminate high-risk individuals. The counterintuitive findings of a recent FDA meta-analysis of suicidality in pediatric populations treated with antidepressant medications are interpreted in the light of this counterfactual model.
  • RESULTS: When the causal effect of an intervention can vary across individuals, the potential for selection bias (in the sense of a threat to external validity) can be serious. In particular, we demonstrate that the stricter the inclusion/exclusion criteria the greater the potential inflation of relative risk. A critical factor in determining bias is the extent to which individuals with differing types of causal effects can be distinguished prior to sampling. Furthermore, we propose methods that can sometimes be useful to identify the existence of bias in an actual study. When applied to the FDA meta-analysis of pediatric suicidality in RCTs of modern antidepressant medications, these methods suggest that the elevated risk observed may be an artifact of selection bias.
  • LIMITATIONS: Real-life scenarios are generally more complex than the counterfactual model presented here. Future modeling efforts are needed to refine and extend our approach.
  • CONCLUSIONS: When variation of treatment effects across individuals is plausible, lack of generalizability should be a serious concern. Therefore, external validity of RCTs needs to be carefully considered in the design of an RCT and the interpretation of its results, especially when the study can influence regulatory decisions about drug safety. RCTs should not automatically be considered definitive, especially when their results conflict with those of observational studies. Whenever possible, empirical evidence of bias resulting from sample selection should be obtained and taken into account.