Overviews or Conceptual Frameworks


Ackerman, B., Schmid, I., Rudolph, K. E., Seamans, M. J., Susukida, R., Mojtabai, R., & Stuart, E. A. (2019). Implementing statistical methods for generalizing randomized trial findings to a target population. Addictive Behaviors, 94, 124–132.

  • Randomized trials are considered the gold standard for assessing the causal effects of a drug or intervention in a study population, and their results are often utilized in the formulation of health policy. However, there is growing concern that results from trials do not necessarily generalize well to their respective target populations, in which policies are enacted, due to substantial demographic differences between study and target populations. In trials related to substance use disorders (SUDs), especially, strict exclusion criteria make it challenging to obtain study samples that are fully “representative” of the populations that policymakers may wish to generalize their results to.
  • In this paper, we provide an overview of post-trial statistical methods for assessing and improving upon the generalizability of a randomized trial to a well-defined target population. We then illustrate the different methods using a randomized trial related to methamphetamine dependence and a target population of substance abuse treatment seekers, and provide software to implement the methods in R using the “generalize” package.
  • We discuss several practical considerations for researchers who wish to utilize these tools, such as the importance of acquiring population-level data to represent the target population of interest, and the challenges of data harmonization.

Andrews, I., & Oster, E. (2019). A simple approximation for evaluating external validity bias. Economics Letters, 178, 58–62.

  • We develop a simple approximation that relates the total external validity bias in randomized trials to (i) bias from selection on observables and (ii) a measure for the role of treatment effect heterogeneity in driving selection into the experimental sample.

Bell, S. H., & Stuart, E. A. (2016). On the “where” of social experiments: The nature and extent of the generalizability problem. New Directions for Evaluation, 2016(152), 47–59.

  • Although randomized experiments are lauded for their high internal validity, they have been criticized for the limited external validity of their results. This chapter describes research strategies for investigating how much nonrepresentative site selection may limit external validity and bias impact findings. The magnitude of external validity bias is potentially much larger than what is thought of as an acceptable level of internal validity bias. The chapter argues that external validity bias should always be investigated by the best available means and addressed directly when presenting evaluation results. These observations flag the importance of making external validity a priority in evaluation planning.

Bell, S. H., & Stuart, E. A. (2016). On the “where” of social experiments: The nature and extent of the generalizability problem. New Directions for Evaluation, 2016(152), 47–59.

  • Although randomized experiments are lauded for their high internal validity, they have been criticized for the limited external validity of their results. This chapter describes research strategies for investigating how much nonrepresentative site selection may limit external validity and bias impact findings. The magnitude of external validity bias is potentially much larger than what is thought of as an acceptable level of internal validity bias. The chapter argues that external validity bias should always be investigated by the best available means and addressed directly when presenting evaluation results. These observations flag the importance of making external validity a priority in evaluation planning.

[Cook, T. D. (2014)](). Generalizing causal knowledge in the policy sciences: External validity as a task of both multi-attribute representation and multi-attribute extrapolation. Journal of Policy Analysis and Management, 527–536.

  • I have been asked to write about methodological issues likely to be prominent in future public policy research. Many issues would deserve attention in a longer presentation, but here I want to concentrate on external validity and its links to evidence-based policy. Such policy uses social science knowledge about what has worked in the past to inform policy decisions in the future. This requires justified procedures for describing the populations of persons. settings, and times in which a given causal relationship has been demonstrated to date, and justified procedures for moving from operational details about the cause and effect to the category labels we use to designate the more general cause or effect constructs. We call this the representation function since the need is to know what the sampling particulars represent as more general populations or categories.
  • The traditional sampling theory framing of this issue would be the following: Given the populations of persons, patting times. and treatment and outcome constructs to which I want to generalize, how well do the specifics actually sampled match these populations or categories. Informing future policy decisions also requires justified procedures for extrapolating past findings to future periods when the populations of treatment providers and recipients might be different, when adaptations of a previously studied treatment might be required. when a novel outcome is targeted. when the € application might be to situations different from cartier, and when other factors affecting the outcome are novel too, We call this the extrapolation function since inferences are required about populations and categories that are now in some ways different from the sampled study particulars.
  • Sampling theory cannot even pretend to deal with the framing of causal generalization as extrapolation since the emphasis is on taking observed causal findings and projecting them beyond the observed sampling specifics. We argue here that both representation and extrapolation are part of a broad and useful understanding of external validity; that each has been quite neglected in the past relative to internal validity—namely, whether the link between manipulated treatments and observed effects is plausibly causal: that few practical methods exist for validly representing the populations and other constructs sampled in the exist literature: and that even fewer such methods exist for extrapolation. Yet, causal extrapolation is more important for the policy sciences. | argue, than is causal representation.

Cook, T. D., Campbell, D. T., & Shadish, W. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin Boston, MA.

[Dahabreh, I. J., Haneuse, S. J. A., Robins, J. M., Robertson, S. E., Buchanan, A. L., Stuart, E. A., & Hernán, M. A. (2021)]?(https://doi.org/10.1093/aje/kwaa270 ). Study designs for extending causal inferences from a randomized trial to a target population. American Journal of Epidemiology, 190(8), 1632–1642.

  • In this article, we examine study designs for extending (generalizing or transporting) causal inferences from a randomized trial to a target population. Specifically, we consider nested trial designs, where randomized individuals are nested within a sample from the target population, and nonnested trial designs, including composite data-set designs, where observations from a randomized trial are combined with those from a separately obtained sample of nonrandomized individuals from the target population. We show that the counterfactual quantities that can be identified in each study design depend on what is known about the probability of sampling nonrandomized individuals.
  • For each study design, we examine identification of counterfactual outcome means via the g-formula and inverse probability weighting. Last, we explore the implications of the sampling properties underlying the designs for the identification and estimation of the probability of trial participation.

[Dahabreh, I. J., & Hernán, M. A. (2019)](). Extending inferences from a randomized trial to a target population. European Journal of Epidemiology, 34(8), 719–722.

  • In this issue, Weiss discusses “generalizing” inferences from randomized trials to other populations [1]. However, he does not explicitly define what “generalizing” means, assumes that “generalizing” the results of a randomized trial has a single goal, and reduces generalizability to a binary subjective judgment—findings are either generalizable or not generalizable. A growing literature (e.g., [1–13]) precisely defines the several meanings and goals of extending inferences from randomized trials to another population, and describes analyses whose findings go beyond simple binary judgements. Here, we provide a non-technical overview of this literature. First, we briefly review the main concepts, then we outline the available study designs and statistical approaches.

Dahabreh, I. J., Robertson, S. E., & Steingrimsson, J. A. (2022). Learning about treatment effects in a new target population under transportability assumptions for relative effect measures. ArXiv Preprint ArXiv:2202.11622.

  • Epidemiologists and applied statisticians often believe that relative effect measures conditional on covariates, such as risk ratios and mean ratios, are “transportable” across populations. Here, we examine the identification of causal effects in a target population using an assumption that conditional relative effect measures (e.g., conditional risk ratios or mean ratios) are transportable from a trial to the target population.
  • We show that transportability for relative effect measures is largely incompatible with transportability for difference effect measures, unless the treatment has no effect on average or one is willing to make even stronger transportability assumptions, which imply the transportability of both relative and difference effect measures. We then describe how marginal causal estimands in a target population can be identified under the assumption of transportability of relative effect measures, when we are interested in the effectiveness of a new experimental treatment in a target population where the only treatment in use is the control treatment evaluated in the trial.
  • We extend these results to consider cases where the control treatment evaluated in the trial is only one of the treatments in use in the target population, under an additional partial exchangeability assumption in the target population (i.e., a partial assumption of no unmeasured confounding in the target population). We also develop identification results that allow for the covariates needed for transportability of relative effect measures to be only a small subset of the covariates needed to control confounding in the target population. Last, we propose estimators that can be easily implemented in standard statistical software.

Dahabreh, I. J., Robins, J. M., Haneuse, S. J., Saeed, I., Robertson, S. E., Stuart, E. A., & Hernán, M. A. (2019). Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population. ArXiv Preprint ArXiv:1905.10684.

  • Extending (generalizing or transporting) causal inferences from a randomized trial to a target population requires “generalizability” or “transportability” assumptions, which state that randomized and non-randomized individuals are exchangeable conditional on baseline covariates. These assumptions are made on the basis of background knowledge, which is often uncertain or controversial, and need to be subjected to sensitivity analysis.
  • We present simple methods for sensitivity analyses that do not require detailed background knowledge about specific unknown or unmeasured determinants of the outcome or modifiers of the treatment effect. Instead, our methods directly parameterize violations of the assumptions using bias functions.
  • We show how the methods can be applied to non-nested trial designs, where the trial data are combined with a separately obtained sample of non-randomized individuals, as well as to nested trial designs, where a clinical trial is embedded within a cohort sampled from the target population. We illustrate the methods using data from a clinical trial comparing treatments for chronic hepatitis C infection.

Gargani, J., & Donaldson, S. I. (2011). What Works for Whom, Where, Why, for What, and When? Using Evaluation Evidence to Take Action in Local Contexts. New Directions for Evaluation, 130, 17–30. ERIC.

  • This chapter describes a concrete process that stakeholders can use to make predictions about the future performance of programs in local contexts. Within the field of evaluation, the discussion of validity as it relates to outcome evaluation seems to be focused largely on questions of internal validity (Did it work?) with less emphasis on external validity (Will it work?). However, recent debates about the credibility of evaluation evidence have called attention to how evaluations can inform predictions about future performance.
  • Using this as a starting point, we expand upon the traditional framework regarding external validity that is closely associated with Donald Campbell. The result is a process for making predictions and taking action that is collaborative, systematic, feasible, and transparent.

Gheorghe, A., Roberts, T., Hemming, K., & Calvert, M. (2015). Evaluating the Generalisability of Trial Results: Introducing a Centre- and Trial-Level Generalisability Index. PharmacoEconomics, 33(11), 1195–1214.

  • BACKGROUND: Few randomised controlled trials (RCTs) recruit centres representatively, which may limit the external validity of trial results. OBJECTIVE: The aim of this study was to propose a proof-of-concept method of assessing the generalisability of the clinical and cost-effectiveness findings of a given RCT.
  • METHODS: We developed a generalisability index (Gix), informed by centre-level characteristics, as a measure of centre and trial representativeness. The centre-level Gix quantifies how representative a centre is in relation to its jurisdiction, e.g. a country or health authority. The trial-level Gix quantifies how representative trial recruitment is in relation to clinical practice in the jurisdiction. Taking a real-world RCT as a case study and assuming trial-wide results to represent “true jurisdiction values”, we used simulation methods to recreate 5000 RCTs and investigate the relationship between trial representativeness, reflected by the standardised trial-Gix, and the deviation of simulated trial results from the “true values”.
  • RESULTS: The simulation study provides evidence that trial results (odds ratio for the primary outcome and incremental quality-adjusted life-years) were influenced by the representativeness of the sample of recruiting centres. Simulated RCTs with the closest results to the “true values” were those whose recruitment closely mirrored the jurisdiction-wide context. Results appeared robust to six alternative specifications of the Gix.
  • CONCLUSIONS: Our findings suggest that an unrepresentative selection of centres limits the external validity of trial results. The Gix may be a valuable tool to help facilitate rational selection of trial centres and ensure the generalisability of results at the jurisdiction level.

Green, L. W., & Glasgow, R. E. (2006). Evaluating the Relevance, Generalization, and Applicability of Research: Issues in External Validation and Translation Methodology. Evaluation & the Health Professions, 29(1), 126–153.

  • Starting with the proposition that “if we want more evidence-based practice, we need more practice-based evidence,” this article (a) offers questions and guides that practitioners, program planners, and policy makers can use to determine the applicability of evidence to situations and populations other than those in which the evidence was produced (generalizability), (b) suggests criteria that reviewers can use to evaluate external validity and potential for generalization, and © recommends procedures that practitioners and program planners can use to adapt evidencebased interventions and integrate them with evidence on the population and setting characteristics, theory, and experience into locally appropriate programs.
  • The development and application in tandem of such questions, guides, criteria, and procedures can be a step toward increasing the relevance of research for decision making and should support the creation and reporting of more practice-based research having high external validity.

Hartman, E., Grieve, R., Ramsahai, R., & Sekhon, J. S. (2015). From SATE to PATT: combining experimental with observational studies to estimate population treatment effects. JR Stat. Soc. Ser. A Stat. Soc.(Forthcoming). Doi, 10, 1111.

  • Randomised controlled trials (RCTs) can provide unbiased estimates of sample average treatment effects. However, a common concern is that RCTs often fail to provide unbiased estimates of population average treatment effects. We derive the assumptions for identifying population average treatment effects from RCTs. We provide a set of placebo tests, which formally follow from the identifying assumptions, that can assess whether the assumptions hold.
  • We offer new research designs for estimating population effects that use non-random studies (NRSs) to adjust the RCT data. One design does not require a selection on observables assumption. We apply our approach to a cost-effectiveness analysis of a controversial clinical intervention, Pulmonary Artery Catheterization (PAC).

Hedges, L. V. (2018). Challenges in building usable knowledge in education. Journal of Research on Educational Effectiveness, 11(1), 1–21.

  • The scientific rigor of education research has improved dramatically since the year 2000. Much of the credit for this improvement is deserved by Institute of Education Sciences (IES) policies that helped create a demand for rigorous research; increased human capital capacity to carry out such work; provided funding for the work itself; and collected, evaluated, and made available the results of that work through the What Works Clearinghouse.
  • Major challenges still remain for education research, however. One challenge is dealing with the replication crisis that has plagued other scientific fields and is likely to be a problem in education science. A second challenge is better supporting the generalizability of education research. A third challenge is adapting our rigorous research designs to the increasing complexity of our interventions and our questions about the mechanisms by which these interventions achieve their effects. Promising approaches to meet each of these challenges are suggested.

Hotz, V. J., Imbens, G. W., & Mortimer, J. H. (2005). Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics, 125(1–2), 241–270.

  • We investigate the problem of predicting the average effect of a new training program using experiences with previous implementations. There are two principal complications in doing so. First, the population in which the new program will be implemented may differ from the population in which the old program was implemented. Second, the two programs may differ in the mix or nature of their components, or in their efficacy across different sub-populations. The first problem is similar to the problem of non-experimental evaluations.
  • The ability to adjust for population differences typically depends on the availability of characteristics of the two populations and the extent of overlap in their distributions. The ability to adjust for differences in the programs themselves may require more detailed data on the exact treatments received by individuals than are typically available. This problem has received less attention, although it is equally important for the prediction of the efficacy of new programs. To investigate the empirical importance of these issues, we compare four experimental Work INcentive demonstration programs implemented in the mid-1980s in different parts of the U.S.
  • We find that adjusting for pre-training earnings and individual characteristics removes many of the differences between control units that have some previous employment experience. Since the control treatment is the same in all locations, namely embargo from the program services, this suggests that differences in populations served can be adjusted for in this sub-population. We also find that adjusting for individual characteristics is more successful at removing differences between control group members in different locations that have some employment experience in the preceding four quarters than for control group members with no previous work experience. Perhaps more surprisingly, our ability to predict the outcomes of trainees after adjusting for individual characteristics is similar.
  • We surmise that differences in treatment components across training programs are not sufficiently large to lead to substantial differences in our ability to predict trainees’ post-training earnings for many of the locations in this study. However, in the sub-population with no previous work experience there is some evidence that unobserved heterogeneity leads to difficulties in our ability to predict outcomes across locations for controls. [ABSTRACT FROM AUTHOR]

Jaciw, A., Newman, D., & Society for Research on Educational Effectiveness (SREE). (2011). External Validity in the Context of RCTs: Lessons from the Causal Explanatory Tradition. In Society for Research on Educational Effectiveness (Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; Fax: 202-640-4401; e-mail: inquiries@sree.org; Web site: http://www.sree.org). Society for Research on Educational Effectiveness; ERIC. ‘

  • The purpose of the current work is to apply several main principles of the causal explanatory approach for establishing external validity to the experimental arena. By spanning the paradigm of the experimental approach and the school of program evaluation founded by Lee Cronbach and colleagues, the authors address the question of how research programs that involve experiments can be expanded to make external validity more of a priority. They bring to bear three central concerns of the causal explanatory approach on the activity of conducting randomized trials with a view to establishing external validity: (1) the role of interactions, (2) the need for ecologically relevant generalizations (3) the time-dependency of generalized causal inferences. (Contains 1 figure.)

Joyce, K. E. (2019). The Key Role of Representativeness in Evidence-Based Education. Educational Research and Evaluation, 25(1–2), 43–62. ERIC.

  • Within evidence-based education, results from randomised controlled trials (RCTs), and meta-analyses of them, are taken as reliable evidence for effectiveness – they speak to “what works”. Extending RCT results requires establishing that study samples and settings are representative of the intended target. Although widely recognised as important for drawing causal inferences from RCTs, claims regarding representativeness tend to be poorly evidenced. Strategies for demonstrating it typically involve comparing observable characteristics (e.g., race, gender, location) of study samples to those in the population of interest to decision makers.
  • This paper argues that these strategies provide insufficient evidence for establishing representativeness. Characteristics typically used for comparison are unlikely to be causally relevant to all educational interventions. Treating them as evidence that supports extending RCT results without providing evidence demonstrating their relevance undermines the inference. Determining what factors are causally relevant requires studying the causal mechanisms underlying the interventions in question.

Koepsell, T. D., Zatzick, D. F., & Rivara, F. P. (2011). Estimating the population impact of preventive interventions from randomized trials. American Journal of Preventive Medicine, 40(2), 191–198.

  • Growing concern about the limited generalizability of trials of preventive interventions has led to several proposals concerning the design, reporting, and interpretation of such trials.
  • This paper presents an epidemiologic framework that highlights three key determinants of population impact of many prevention programs: the proportion of the population at risk who would be candidates for a generic intervention in routine use, the proportion of those candidates who are actually intervened on through a specific program, and the reduction in incidence produced by that program among recipients. It then describes how the design of a prevention trial relates to estimating these quantities. Implications of the framework include the following: (1) reach is an attribute of a program, whereas external validity is an attribute of a trial, and the two should not be conflated; (2) specification of a defined target population at risk is essential in the long run and merits greater emphasis in the planning and interpretation of prevention trials; (3) with due attention to sampling frame and sampling method, the process of subject recruitment for a trial can yield key information about quantities that are important for assessing its potential population impact; and (4) exclusions during subject recruitment can be conceptually separated into intervention-driven, program-driven, and trial-design-driven exclusions, which have quite different implications for trial interpretation and for estimating population impact of the intervention studied.

Kohler, U., Kreuter, F., & Stuart, E. A. (2019). Nonprobability sampling and causal analysis. Annual Review of Statistics and Its Application, 6, 149–172.

  • The long-standing approach of using probability samples in social science research has come under pressure through eroding survey response rates, advanced methodology, and easier access to large amounts of data. These factors, along with an increased awareness of the pitfalls of the nonequivalent comparison group design for the estimation of causal effects, have moved the attention of applied researchers away from issues of sampling and toward issues of identification. This article discusses the usability of samples with unknown selection probabilities for various research questions. In doing so, we review assumptions necessary for descriptive and causal inference and discuss research strategies developed to overcome sampling limitations.

Lesko, C. R., Ackerman, B., Webster-Clark, M., & Edwards, J. K. (2020). Target validity: Bringing treatment of external validity in line with internal validity. Current Epidemiology Reports, 7(3), 117–124.

  • PURPOSE OF REVIEW: “Target bias” is the difference between an estimate of association from a study sample and the causal effect in the target population of interest. It is the sum of internal and external bias. Given the extensive literature on internal validity, here, we review threats and methods to improve external validity.
  • RECENT FINDINGS: External bias may arise when the distribution of modifiers of the effect of treatment differs between the study sample and the target population. Methods including those based on modeling the outcome, modeling sample membership, and doubly robust methods are available, assuming data on the target population is available.
  • SUMMARY: The relevance of information for making policy decisions is dependent on both the actions that were studied and the sample in which they were evaluated. Combining methods for addressing internal and external validity can improve the policy relevance of study results.

Lesko, C. R., Buchanan, A. L., Westreich, D., Edwards, J. K., Hudgens, M. G., & Cole, S. R. (2017). Generalizing study results: A potential outcomes perspective. Epidemiology (Cambridge, Mass.), 28(4), 553.

  • Great care is taken in epidemiologic studies to ensure the internal validity of causal effect estimates; however, external validity has received considerably less attention. When the study sample is not a random sample of the target population, the sample average treatment effect, even if internally valid, cannot usually be expected to equal the average treatment effect in the target population. The utility of an effect estimate for planning purposes and decision making will depend on the degree of departure from the true causal effect in the target population due to problems with both internal and external validity.
  • Herein, we review concepts from recent literature on generalizability, one facet of external validity, using the potential outcomes framework. Identification conditions sufficient for external validity closely parallel identification conditions for internal validity, namely conditional exchangeability; positivity; the same distributions of the versions of treatment; no interference; and no measurement error. We also require correct model specification.
  • Under these conditions, we discuss how a version of direct standardization (the g-formula, adjustment formula, or transport formula) or inverse probability weighting can be used to generalize a causal effect from a study sample to a well-defined target population, and demonstrate their application in an illustrative example.

Olsen, R. B., Orr, L. L., Bell, S. H., & Stuart, E. A. (2013). External Validity in Policy Evaluations That Choose Sites Purposively: External Validity in Policy Evaluations. Journal of Policy Analysis and Management, 32(1), 107–121.

  • Evaluations of the impact of social programs are often carried out in multiple sites, such as school districts, housing authorities, local TANF offices, or One-Stop Career Centers. Most evaluations select sites purposively following a process that is nonrandom. Unfortunately, purposive site selection can produce a sample of sites that is not representative of the population of interest for the program. In this paper, we propose a conceptual model of purposive site selection.
  • We begin with the proposition that a purposive sample of sites can usefully be conceptualized as a random sample of sites from some well-defined population, for which the sampling probabilities are unknown and vary across sites. This proposition allows us to derive a formal, yet intuitive, mathematical expression for the bias in the pooled impact estimate when sites are selected purposively. This formula helps us to better understand the consequences of selecting sites purposively, and the factors that contribute to the bias.
  • Additional research is needed to obtain evidence on how large the bias tends to be in actual studies that select sites purposively, and to develop methods to increase the external validity of these studies.

Pearl, J. (2015). Generalizing experimental findings. Journal of Causal Inference, 3(2), 259–266.

Pressler, T. R., & Kaizar, E. E. (2013). The use of propensity scores and observational data to estimate randomized controlled trial generalizability bias. Statistics in Medicine, 32(20), 3552–3568.

  • While randomized controlled trials (RCT) are considered the “gold standard” for clinical studies, the use of exclusion criteria may impact the external validity of the results. It is unknown whether estimators of effect size are biased by excluding a portion of the target population from enrollment.
  • We propose to use observational data to estimate the bias due to enrollment restrictions, which we term generalizability bias. In this paper we introduce a class of estimators for the generalizability bias and use simulation to study its properties in the presence of nonconstant treatment effects. We find the surprising result that our estimators can be unbiased for the true generalizability bias even when all potentially confounding variables are not measured. In addition, our proposed doubly robust estimator performs well even for mis-specified models.

Seamans, M. J., Hong, H., Ackerman, B., Schmid, I., & Stuart, E. A. (2021). Generalizability of subgroup effects. Epidemiology, 32(3), 389–392.

  • Generalizability methods are increasingly used to make inferences about the effect of interventions in target populations using a study sample. Most existing methods to generalize effects from sample to population rely on the assumption that subgroup-specific effects generalize directly. However, researchers may be concerned that in fact subgroup-specific effects differ between sample and population.
  • In this brief report, we explore the generalizability of subgroup effects. First, we derive the bias in the sample average treatment effect estimator as an estimate of the population average treatment effect when subgroup effects in the sample do not directly generalize. Next, we present a Monte Carlo simulation to explore bias due to unmeasured heterogeneity of subgroup effects across sample and population. Finally, we examine the potential for bias in an illustrative data example. Understanding the generalizability of subgroup effects may lead to increased use of these methods for making externally valid inferences of treatment effects using a study sample.

Stuart, E. A. (2017). Generalizability of clinical trials results. In Methods in Comparative Effectiveness Research (pp. 178–199). CRC Press, Taylor & Francis Group.

  • Randomized trials are seen as the gold standard for estimating the effects of interventions because, when implemented well, they provide unbiased estimates of treatment effects in the sample at hand. However, recent years have seen an increased understanding of their limitations in providing evidence that is more broadly applicable and relevant for real-world practice, known as “generalizability.” A lack of generalizability may be a particular problem for comparative effectiveness research (CER), which aims to help clinicians and policymakers make informed decisions for individuals and populations.
  • This chapter outlines recent advances in methods to assess and enhance the generalizability of randomized trials in CER, discussing both design and analysis strategies. A case study is provided of a weighting approach that reweights the trial sample to reflect the target population with respect to effect modifiers. Recommendations for further research and the practical use of the methods discussed are also provided.

Stuart, E. A., Ackerman, B., & Westreich, D. (2018). Generalizability of randomized trial results to target populations: Design and analysis possibilities. Research on Social Work Practice, 28(5), 532–537.

  • Randomized trials play an important role in estimating the effect of a policy or social work program in a given population. While most trial designs benefit from strong internal validity, they often lack external validity, or generalizability, to the target population of interest. In other words, one can obtain an unbiased estimate of the study sample average treatment effect from a randomized trial; however, this estimate may not equal the target population average treatment effect if the study sample is not fully representative of the target population.
  • This article provides an overview of existing strategies to assess and improve upon the generalizability of randomized trials, both through statistical methods and study design, as well as recommendations on how to implement these ideas in social work research.

Stuart, E. A., Bradshaw, C. P., & Leaf, P. J. (2015). Assessing the generalizability of randomized trial results to target populations. Prevention Science, 16(3), 475–485.

  • Recent years have seen increasing interest in and attention to evidence-based practices, where the “evidence” generally comes from well-conducted randomized trials. However, while those trials yield accurate estimates of the effect of the intervention for the participants in the trial (known as “internal validity”), they do not always yield relevant information about the effects in a particular target population (known as “external validity”). This may be due to a lack of specification of a target population when designing the trial, difficulties recruiting a sample that is representative of a prespecified target population, or to interest in considering a target population somewhat different from the population directly targeted by the trial.
  • This paper first provides an overview of existing design and analysis methods for assessing and enhancing the ability of a randomized trial to estimate treatment effects in a target population. It then provides a case study using one particular method, which weights the subjects in a randomized trial to match the population on a set of observed characteristics. The case study uses data from a randomized trial of school-wide positive behavioral interventions and supports (PBIS); our interest is in generalizing the results to the state of Maryland. In the case of PBIS, after weighting, estimated effects in the target population were similar to those observed in the randomized trial.
  • The paper illustrates that statistical methods can be used to assess and enhance the external validity of randomized trials, making the results more applicable to policy and clinical questions. However, there are also many open research questions; future research should focus on questions of treatment effect heterogeneity and further developing these methods for enhancing external validity. Researchers should think carefully about the external validity of randomized trials and be cautious about extrapolating results to specific populations unless they are confident of the similarity between the trial sample and that target population

Tipton, E. (2014). How generalizable is your experiment? Comparing a sample and population through a generalizability index. Journal of Educational and Behavioral Statistics, 39(6), 478–501.

  • Although a large-scale experiment can provide an estimate of the average causal impact for a program, the sample of sites included in the experiment is often not drawn randomly from the inference population of interest. In this article, we provide a generalizability index that can be used to assess the degree of similarity between the sample of units in an experiment and one or more inference populations on a set of selected covariates. The index takes values between 0 and 1 and indicates both when a sample is like a miniature of the population and how well reweighting methods may perform when differences exist. Results of simulation studies are provided that develop rules of thumb for interpretation as well as an example.

Tipton, E., & Olsen, R. B. (2018). A review of statistical methods for generalizing from evaluations of educational interventions. Educational Researcher, 47(8), 516–524.

  • School-based evaluations of interventions are increasingly common in education research. Ideally, the results of these evaluations are used to make evidence-based policy decisions for students. However, it is difficult to make generalizations from these evaluations because the types of schools included in the studies are typically not selected randomly from a target population.
  • This paper provides an overview of statistical methods for improving generalizations from intervention research in education. These are presented as a series of steps aimed at improving research design—particularly recruitment—as well as methods for assessing and summarizing generalizability and estimating treatment impacts for clearly defined target populations.

Tipton, E., & Olsen, R. B. (2022). Enhancing the Generalizability of Impact Studies in Education. Toolkit. NCEE 2022-003. National Center for Education Evaluation and Regional Assistance.

Westreich, D., Edwards, J. K., Lesko, C. R., Cole, S. R., & Stuart, E. A. (2019). Target validity and the hierarchy of study designs. American Journal of Epidemiology, 188(2), 438–443.

  • In recent years, increasing attention has been paid to problems of external validity, specifically to methodological approaches for both quantitative generalizability and transportability of study results. However, most approaches to these issues have considered external validity separately from internal validity.
  • Here we argue that considering either internal or external validity in isolation may be problematic. Further, we argue that a joint measure of the validity of an effect estimate with respect to a specific population of interest may be more useful: We call this proposed measure target validity.
  • In this work, we introduce and formally define target bias as the total difference between the true causal effect in the target population and the estimated causal effect in the study sample, and target validity as target bias = 0. We illustrate this measure with a series of examples and show how this measure may help us to think more clearly about comparisons between experimental and nonexperimental research results. Specifically, we show that even perfect internal validity does not ensure that a causal effect will be unbiased in a specific target population.

Degtiar, I., & Rose, S. (2022). A review of generalizability and transportability. Annual Review of Statistics and Its Application, 2326-8298.

  • When assessing causal effects, determining the target population to which the results are intended to generalize is a critical decision. Randomized and observational studies each have strengths and limitations for estimating causal effects in a target population.
  • Estimates from randomized data may have internal validity but are often not representative of the target population. Observational data may better reflect the target population, and hence be more likely to have external validity, but are subject to potential bias due to unmeasured confounding.
  • While much of the causal inference literature has focused on addressing internal validity bias, both internal and external validity are necessary for unbiased estimates in a target population.
  • This article presents a framework for addressing external validity bias, including a synthesis of approaches for generalizability and transportability, and the assumptions they require, as well as tests for the heterogeneity of treatment effects and differences between study and target populations.