Empirical Evidence on Generalizability

Allcott, H. (2015). Site selection bias in program evaluation. The Quarterly Journal of Economics, 130(3), 1117–1165.

“Site selection bias” can occur when the probability that a program is adopted or evaluated is correlated with its impacts.
I test for site selection bias in the context of the Opower energy conservation programs, using 111 randomized control trials involving 8.6 million households across the United States. Predictions based on rich microdata from the first 10 replications substantially overstate efficacy in the next 101 sites. Several mechanisms caused this positive selection. For example, utilities in more environmentalist areas are more likely to adopt the program, and their customers are more responsive to the treatment. Also, because utilities initially target treatment at higher-usage consumer subpopulations, efficacy drops as the program is later expanded.
The results illustrate how program evaluations can still give systematically biased out-of-sample predictions, even after many replications.

Bell, S. H., Olsen, R. B., Orr, L. L., & Stuart, E. A. (2016). Estimates of external validity bias when impact evaluations select sites nonrandomly. Educational Evaluation and Policy Analysis, 38(2), 318–335.

Evaluations of educational programs or interventions are typically conducted in nonrandomly selected samples of schools or districts. Recent research has shown that nonrandom site selection can yield biased impact estimates.
To estimate the external validity bias from nonrandom site selection, we combine lists of school districts that were selected nonrandomly for 11 educational impact studies with population data on student outcomes from the Reading First program.
Our analysis finds that on average, if an impact study of Reading First were conducted in the districts from these 11 studies, the impact estimate would be biased downward. In particular, it would be 0.10 standard deviations lower than the impact in the broader population from which the samples were selected, a substantial bias based on several benchmarks of comparison.

Biener, L., DePue, J. D., Emmons, K. M., Linnan, L., & Abrams, D. B. (1994). Recruitment of work sites to a health promotion research trial. Implications for generalizability. Journal of Occupational Medicine: Official Publication of the Industrial Medical Association, 36(6), 631–636.

The characteristics of companies that either accepted or declined participation in a 5-year randomized trial of a multirisk factor health promotion intervention were compared to investigate potential limitations on the generalizability of research findings.
A representative sample of 151 manufacturing work sites in the northeast was recruited to participate. Sixty-four of the companies were determined to be eligible and 10 others, which refused to have an administrator interviewed, were presumed to be eligible. Of this group, 27 companies agreed to participate.
Work force demographics, shift structure, and prior history of health promotion offerings were not significantly different in the two groups. However, participating companies employed fewer workers and had a more favorable financial outlook than did companies that declined to participate. Implications of these findings for research on work site health promotion are discussed.

Blanco, C., Hoertel, N., Franco, S., Olfson, M., He, J.-P., López, S., González-Pinto, A., Limosin, F., & Merikangas, K. R. (2017). Generalizability of Clinical Trial Results for Adolescent Major Depressive Disorder. Pediatrics, 140(6).

BACKGROUND: Although there have been a number of clinical trials evaluating treatments for adolescents with major depressive disorder (MDD), the generalizability of those trials to samples of depressed adolescents who present for routine clinical care is unknown. Examining the generalizability of clinical trials of pharmacological and psychotherapy interventions for adolescent depression can help administrators and frontline practitioners determine the relevance of these studies for their patients and may also guide eligibility criteria for future clinical trials in this clinical population.
METHODS: Data on nationally representative adolescents were derived from the National Comorbidity Survey: Adolescent Supplement. To assess the generalizability of adolescent clinical trials for MDD, we applied a standard set of eligibility criteria representative of clinical trials to all adolescents in the National Comorbidity Survey: Adolescent Supplement with a Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition diagnosis of MDD (N = 592).
RESULTS: From the overall MDD sample, 61.9% would have been excluded from a typical pharmacological trial, whereas 42.2% would have been excluded from a psychotherapy trial. Among those who sought treatment (n = 412), the corresponding exclusion rates were 72.7% for a pharmacological trial and 52.2% for a psychotherapy trial. The criterion leading to the largest number of exclusions was “significant risk of suicide” in both pharmacological and psychotherapy trials.
CONCLUSIONS: Pharmacological and, to a lesser extent, psychotherapy clinical trials likely exclude most adolescents with MDD. Careful consideration should be given to balancing eligibility criteria and internal validity with applicability in routine clinical care while ensuring patient safety.

Braslow, J. T., Duan, N., Starks, S. L., Polo, A., Bromley, E., & Wells, K. B. (2005). Generalizability of studies on mental health treatment and outcomes, 1981 to 1996. Psychiatric Services (Washington, D.C.), 56(10), 1261–1268.

OBJECTIVE: This study operationalized and measured the external validity, or generalizability, of studies on mental health treatment and outcomes published in four journals between 1981 and 1996.
METHOD: MEDLINE was searched for articles on mental health treatment and outcomes that were published in four leading psychiatry and psychology journals between 1981 and 1996. A 156-item instrument was used to assess generalizability of study findings.
RESULTS: Of more than 9,000 citations, 414 eligible studies were identified. Inclusion of community sites and patients from racial or ethnic minority groups were documented in only 12 and 25 percent of studies, respectively. Random or systematic sampling methods were rare (3 percent), and 75 percent of studies did not explicitly address sample representativeness. Studies with funding from the National Institute of Mental Health (NIMH) were more likely than those without NIMH funding to document the inclusion of patients from minority groups (30 percent compared with 20 percent). Randomized studies were more likely than nonrandomized studies to document the inclusion of patients from minority groups (28 percent compared with 17 percent), include patients with comorbid psychiatric conditions (31 percent compared with 19 percent), and attend to sample representativeness (28 percent compared with 15 percent). Modest improvements were seen over time in inclusion of patients from minority groups, inclusion of patients with psychiatric comorbidities, and attention to sample representativeness.
CONCLUSIONS: Generalizability of studies on treatments and outcomes, whether experimental or observational, remained low and poorly documented over the 16-year period.

Canevelli, M., Trebbastoni, A., Quarata, F., D’Antonio, F., Cesari, M., de Lena, C., & Bruno, G. (2017). External Validity of Randomized Controlled Trials on Alzheimer’s Disease: The Biases of Frailty and Biological Aging. Frontiers in Neurology, 8, 628

To date, the external validity of randomized controlled trials (RCTs) on Alzheimer’s disease (AD) has been assessed only considering monodimensional variables. Nevertheless, looking at isolated and single characteristics cannot guarantee a sufficient level of appreciation of the AD patients’ complexity. The only way to understand whether the two worlds (i.e., research and clinics) deal with the same type of patients is to adopt multidimensional approaches more holistically reflecting the biological age of the individual.
In the present study, we compared measures of frailty/biological aging [assessed by a Frailty Index (FI)] of a sample of patients with AD resulted eligible and subsequently included in phase III RCTs compared to patients referring to the same clinical service, but not considered for inclusion.
The “RCT sample” and the “real world sample” were found to be statistically similar for all the considered sociodemographic and clinical variables. Nevertheless, the “real world sample” was found to be significantly frailer compared to the “RCT sample,” as indicated by higher FI scores [0.28 (SD 0.1) vs. 0.17 (SD 0.1); p < 0.001, respectively]. Moreover, when assessing the relationship between FI and age, we found that the correlation was almost null in the “RCT sample” (Spearman’s r = 0.01; p = 0.98), while it was statistically significant in the “real world sample” (r = 0.49; p = 0.02). The application of too rigid designs may result in the poor representativeness of RCT samples. It may even imply the study of a condition biologically different from that observed in the “real world.”
The adoption of multidimensional measures capable to capture the individual’s biological age may facilitate evaluating the external validity of clinical studies, implicitly improving the interpretation of the results and their translation in the clinical arena.

Chaitoff, A., Niforatos, J. D., Gong, J., & Fischer, M. A. (2022). A Comparison of Individuals with Diabetes and EMPA-REG Trial Participants: Exploring Aspects of External Validity. Journal of General Internal Medicine.

BACKGROUND: There is increasing use of sodium glucose co-transporter 2 (SGLT2) inhibitors to treat diabetes. Since trials apply specific entry and exclusion criteria to ensure internal validity, comparisons of trial populations with nationally representative samples can inform the applicability of study findings to practice.
OBJECTIVE: To compare individuals with diabetes from a nationally representative sample to patients who underwent randomization in the EMPA-REG trial. A secondary aim was to characterize what proportion of individuals prescribed an SGLT2 inhibitor in a nationally representative sample would have been included in the EMPA-REG trial. DESIGN: Retrospective cross-sectional study.
PARTICIPANTS: Adults with diabetes who took part in the National Health and Nutrition Examination Survey (NHANES) between 2011-2014 (primary analysis corresponding to EMPA-REG enrollment) and 2015-2018 (secondary analysis corresponding to contemporary sample).
MAIN MEASURES: The primary outcome was a comparison of demographic (age, sex, ethnicity, and pregnancy status), clinical (comorbidities and medication use), examination (weight, body mass index, and systolic and diastolic blood pressure), and laboratory (hgba1c, low- and high-density lipoprotein cholesterol, triglycerides, and estimated glomerular filtration rate) characteristics of NHANES respondents versus EMPA-REG trial participants. The secondary outcome was the proportion of NHANES respondents who had been prescribed an SGLT2 inhibitor that would have met inclusion criteria for the EMPA-REG trial.
KEY RESULTS: There were 655 and 48 respondents, representing a weighted sample of 21,849,775 and 1,062,573 individuals, included in the primary and secondary analyses, respectively. Overall, 7.6% (95% CI 4.8-10.6%) of 2011-2014 NHANES respondents would have met all EMPA-REG trial inclusion criteria. NHANES respondents and EMPA-REG participants differed across demographic, clinical, examination, and laboratory domains. Of NHANES respondents from 2015 to 2018 who were prescribed an SGLT2 inhibitor, 10.6% (95% CI <1-24.7%) would have met all inclusion criteria for the EMPA-REG trial.
CONCLUSIONS: The EMPA-REG population differed from a nationally representative sample, which could affect generalizability.

Coppock, A., Leeper, T. J., & Mullinix, K. J. (2018). Generalizability of heterogeneous treatment effect estimates across samples. Proceedings of the National Academy of Sciences of the United States of America, 115(49), 12441–12446. Academic Search Alumni Edition.

The extent to which survey experiments conducted with non-representative convenience samples are generalizable to target populations depends critically on the degree of treatment effect heterogeneity.
Recent inquiries have found a strong correspondence between sample average treatment effects estimated in nationally representative experiments and in replication studies conducted with convenience samples. We consider here two possible explanations: low levels of effect heterogeneity or high levels of effect heterogeneity that are unrelated to selection into the convenience sample.
We analyze subgroup conditional average treatment effects using 27 original-replication study pairs (encompassing 101,745 individual survey responses) to assess the extent to which subgroup effect estimates generalize. While there are exceptions, the overwhelming pattern that emerges is one of treatment effect homogeneity, providing a partial explanation for strong correspondence across both unconditional and conditional average treatment effect estimates. [ABSTRACT FROM AUTHOR].

Daitch, V., Paul, M., Daikos, G. L., Durante-Mangoni, E., Yahav, D., Carmeli, Y., Benattar, Y. D., Skiada, A., Andini, R., Eliakim-Raz, N., Nutman, A., Zusman, O., Antoniadou, A., Cavezza, G., Adler, A., Dickstein, Y., Pavleas, I., Zampino, R., Bitterman, R., & Zayyad, H. (2021). Excluded versus included patients in a randomized controlled trial of infections caused by carbapenem-resistant Gram-negative bacteria: Relevance to external validity. BMC Infectious Diseases, 21(1), 1–9. Academic Search Alumni Edition.

Background: Population external validity is the extent to which an experimental study results can be generalized from a specific sample to a defined population. In order to apply the results of a study, we should be able to assess its population external validity. We performed an investigator-initiated randomized controlled trial (RCT) (AIDA study), which compared colistin-meropenem combination therapy to colistin monotherapy in the treatment of patients infected with carbapenem-resistant Gram-negative bacteria. In order to examine the study’s population external validity and to substantiate the use of AIDA study results in clinical practice, we performed a concomitant observational trial.
Methods: The study was conducted between October 1st, 2013 and January 31st, 2017 (during the RCTs recruitment period) in Greece, Israel and Italy. Patients included in the observational arm of the study have fulfilled clinical and microbiological inclusion criteria but were excluded from the RCT due to receipt of colistin for > 96 h, refusal to participate, or prior inclusion in the RCT. Non-randomized cases were compared to randomized patients. The primary outcome was clinical failure at 14 days of infection onset.
Results: Analysis included 701 patients. Patients were infected mainly with Acinetobacter baumannii [78.2% (⁵⁴⁸⁄₇₀₁)]. The most common reason for exclusion was refusal to participate [62% (¹⁸³⁄₂₉₅)]. Non-randomized and randomized patients were similar in most of the demographic and background parameters, though randomized patients showed minor differences towards a more severe infection. Combination therapy was less common in non-randomized patients [31.9% (⁵³⁄₁₆₆) vs. 51.2% (²⁰⁸⁄₄₀₆), p = 0.000]. Randomized patients received longer treatment of colistin [13 days (IQR 10-16) vs. 8.5 days (IQR 0-15), p = 0.000]. Univariate analysis showed that non-randomized patients were more inclined to clinical failure on day 14 from infection onset [82% (²⁴²⁄₂₉₅) vs. 75.5% (³⁰⁷⁄₄₀₆), p = 0.042]. After adjusting for other variables, non-inclusion was not an independent risk factor for clinical failure at day 14.
Conclusion: The similarity between the observational arm and RCT patients has strengthened our confidence in the population external validity of the AIDA trial. Adding an observational arm to intervention studies can help increase the population external validity and improve implementation of study results in clinical practice. Trial Registration: The trial was registered with ClinicalTrials.gov, number NCT01732250 on November 22, 2012. [ABSTRACT FROM AUTHOR]

Franzone, A., Heg, D., Räber, L., Valgimigli, M., Piccolo, R., Zanchin, T., Yamaji, K., Stortecky, S., Blöchlinger, S., Hunziker, L., Praz, F., Jüni, P., Windecker, S., & Pilgrim, T. (2016). External validity of the “all-comers” design: Insights from the BIOSCIENCE trial. Clinical Research in Cardiology : Official Journal of the German Cardiac Society, 105(9), 744–754.

OBJECTIVES: We sought to systematically evaluate the external validity of a contemporary randomized controlled stent trial (BIOSCIENCE).
METHODS: Baseline characteristics and clinical outcomes of patients enrolled into the BIOSCIENCE trial at Bern University Hospital (n = 1216) were compared to those of patients included in the CARDIOBASE Bern PCI Registry at the same institution (n = 1045). The primary study endpoint was the rate of target lesion failure (TLF), defined as a composite of cardiac death, target vessel-myocardial infarction (MI) or target lesion revascularization (TLR), at 1 year.
RESULTS: Women were underrepresented in the RCT compared to the registry (25 vs. 29.4 %, p = 0.020). Non-participants were older compared to study participants (69.2 ± 12.4 vs. 67.0 ± 11.6, p < 0.001), and had a higher prevalence of previous cerebrovascular events (10.8 vs. 5.2 %, p < 0.001), and chronic renal failure (35.5 vs. 15.6 %, p < 0.001). ST-segment elevation myocardial infarction (STEMI) and Killip class IV at presentation were more common among non-participants than participants (30.7 vs. 21.1 %, p < 0.001 and 7.8 vs. 0.4 %, p < 0.001, respectively). At 1 year, non-participants experienced a significantly higher rate of TLF, (15.0 vs. 6.5 %, p < 0.001), and patient-oriented composite endpoint (POCE), including death, MI or any repeat revascularization (21.6 vs. 11.2 %, p < 0.001). There was a significant interaction between POCE and presence or absence of an acute coronary syndrome in participants versus non-participants, respectively (p = 0.009).
CONCLUSIONS: Non-participants of this all-comers trial had a higher risk profile and adverse prognosis compared to study participants. Further efforts are needed to improve the external validity of contemporary RCTs.

Gheorghe, A., Roberts, T., Hemming, K., & Calvert, M. (2015). Evaluating the Generalisability of Trial Results: Introducing a Centre- and Trial-Level Generalisability Index. PharmacoEconomics, 33(11), 1195–1214.

BACKGROUND: Few randomised controlled trials (RCTs) recruit centres representatively, which may limit the external validity of trial results.
OBJECTIVE: The aim of this study was to propose a proof-of-concept method of assessing the generalisability of the clinical and cost-effectiveness findings of a given RCT.
METHODS: We developed a generalisability index (Gix), informed by centre-level characteristics, as a measure of centre and trial representativeness. The centre-level Gix quantifies how representative a centre is in relation to its jurisdiction, e.g. a country or health authority. The trial-level Gix quantifies how representative trial recruitment is in relation to clinical practice in the jurisdiction. Taking a real-world RCT as a case study and assuming trial-wide results to represent “true jurisdiction values”, we used simulation methods to recreate 5000 RCTs and investigate the relationship between trial representativeness, reflected by the standardised trial-Gix, and the deviation of simulated trial results from the “true values”.
RESULTS: The simulation study provides evidence that trial results (odds ratio for the primary outcome and incremental quality-adjusted life-years) were influenced by the representativeness of the sample of recruiting centres. Simulated RCTs with the closest results to the “true values” were those whose recruitment closely mirrored the jurisdiction-wide context. Results appeared robust to six alternative specifications of the Gix.
CONCLUSIONS: Our findings suggest that an unrepresentative selection of centres limits the external validity of trial results. The Gix may be a valuable tool to help facilitate rational selection of trial centres and ensure the generalisability of results at the jurisdiction level.

Gianattasio, K. Z., Bennett, E. E., Wei, J., Mehrotra, M. L., Mosley, T., Gottesman, R. F., Wong, D. F., Stuart, E. A., Griswold, M. E., Couper, D., Glymour, M. M., & Power, M. C. (2021). Generalizability of findings from a clinical sample to a community-based sample: A comparison of ADNI and ARIC. Alzheimer’s & Dementia : The Journal of the Alzheimer’s Association, 17(8), 1265–1276.

INTRODUCTION: Clinic-based study samples, including the Alzheimer’s Disease Neuroimaging Initiative (ADNI), offer rich data, but findings may not generalize to community-based settings. We compared associations in ADNI to those in the Atherosclerosis Risk in Communities (ARIC) study to assess generalizability across the two settings.
METHODS: We estimated cohort-specific associations among risk factors, cognitive test scores, and neuroimaging outcomes to identify and quantify the extent of significant and substantively meaningful differences in associations between cohorts. We explored whether using more homogenous samples improved comparability in effect estimates.
RESULTS: The proportion of associations that differed significantly between cohorts ranged from 27% to 34% across sample subsets. Many differences were substantively meaningful (e.g., odds ratios [OR] for apolipoprotein E ε4 on amyloid positivity in ARIC: OR = 2.8, in ADNI: OR = 8.6).
DISCUSSION: A higher proportion of associations differed significantly and substantively than would be expected by chance. Findings in clinical samples should be confirmed in more representative samples.

Greenhouse, J. B., Kaizar, E. E., Kelleher, K., Seltman, H., & Gardner, W. (2008).. Generalizing from clinical trial data: A case study. The risk of suicidality among pediatric antidepressant users. Statistics in Medicine, 27(11), 1801–1813.

For the results of randomized controlled clinical trials (RCTs) and related meta-analyses to be useful in practice, they must be relevant to a definable group of patients in a particular clinical setting. To the extent this is so, we say that the trial is generalizable or externally valid. Although concern about the generalizability of the results of RCTs is often discussed, there are few examples of methods for assessing the generalizability of clinical trial data.
In this paper, we describe and illustrate an approach for making what we call generalizability judgments and illustrate the approach in the context of a case study of the risk of suicidality among pediatric antidepressant users.

Hotz, V. J., Imbens, G. W., & Mortimer, J. H. (2005). Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics, 125(1–2), 241–270.

We investigate the problem of predicting the average effect of a new training program using experiences with previous implementations. There are two principal complications in doing so. First, the population in which the new program will be implemented may differ from the population in which the old program was implemented. Second, the two programs may differ in the mix or nature of their components, or in their efficacy across different sub-populations. The first problem is similar to the problem of non-experimental evaluations.
The ability to adjust for population differences typically depends on the availability of characteristics of the two populations and the extent of overlap in their distributions. The ability to adjust for differences in the programs themselves may require more detailed data on the exact treatments received by individuals than are typically available. This problem has received less attention, although it is equally important for the prediction of the efficacy of new programs.
To investigate the empirical importance of these issues, we compare four experimental Work INcentive demonstration programs implemented in the mid-1980s in different parts of the U.S. We find that adjusting for pre-training earnings and individual characteristics removes many of the differences between control units that have some previous employment experience. Since the control treatment is the same in all locations, namely embargo from the program services, this suggests that differences in populations served can be adjusted for in this sub-population. We also find that adjusting for individual characteristics is more successful at removing differences between control group members in different locations that have some employment experience in the preceding four quarters than for control group members with no previous work experience. Perhaps more surprisingly, our ability to predict the outcomes of trainees after adjusting for individual characteristics is similar, We surmise that differences in treatment components across training programs are not sufficiently large to lead to substantial differences in our ability to predict trainees’ post-training earnings for many of the locations in this study. However, in the sub-population with no previous work experience there is some evidence that unobserved heterogeneity leads to difficulties in our ability to predict outcomes across locations for controls. [ABSTRACT FROM AUTHOR]

Howard-Pitney, B., Fortmann, S. P., & Killen, J. D. (2001). Generalizability of findings from a chewing tobacco cessation clinical trial. Nicotine & Tobacco Research : Official Journal of the Society for Research on Nicotine and Tobacco, 3(4), 347–352.

This study examined selection bias by comparing characteristics of a general population sample of tobacco chewers, participants in a chewing tobacco cessation trial, and non-participants in the trial. A population-based sample of chewers (n = 155) was surveyed by telephone to assess demographics, tobacco-use patterns, and quitting history. Six months later, chewers from this same population were recruited for a cessation trial (n = 401 participants and 68 non-participants). Trial participants differed little from general population chewers on demographics, but they used more chew and were more dependent on nicotine. They were more likely to have tried to quit, received advice to quit and experienced tobacco-related health problems. Trial non-participants were virtually identical to participants on demographic and tobacco use measures.
The findings suggest that clinically tested treatments are generalizable beyond the research setting, because trial participants are demographically representative of the general population of chewing tobacco users, are not biased toward light users, and are representative of those chewers most likely to seek out community-based cessation services outside the trial context.

Hsu, S., Rosen, K. J., Cupertino, A., Temple, L., & Fleming, F. (2022). Generalizability of Randomized Controlled Trials in Rectal Cancer. Journal of Gastrointestinal Surgery, 26(2), 453–465. Academic Search Alumni Edition.

Background: The generalizability of outcomes from randomized controlled trials (RCTs) in oncology is a frequent concern. Given the prevalence and multidisciplinary management of rectal cancer, understanding the generalizability of rectal cancer RCTs is critical to surgical oncologists. Methods: An exhaustive literature review identified 100 non-metastatic rectal cancer RCTs published in English over the past 10 years investigating surgery, chemotherapy, or radiotherapy. In order to evaluate the representativeness of these RCTs compared to the USA and each continent’s rectal cancer populations, demographic characteristics were stratified by surgical versus chemoradiotherapy (CRT) trial and by continent then compared with the National Cancer Database and CANCER TODAY using chi-squared and Welch’s t-tests. Results: Of the 100 trials identified, 65% enrolled significantly younger patients, and 38% enrolled a significantly greater proportion of males than the US rectal cancer population. These demographic differences were more prominent among CRT trials than surgical trials. Half of all trials enrolled patients who were on average more than 7 years younger and enrolled a 5% greater proportion of males than their respective continental rectal cancer populations. Patients enrolled in trials had more advanced cancers than their corresponding continental populations. Sociodemographic data was rarely reported. Conclusion: Patients enrolled in trials were younger, predominantly male, and had advanced stage cancer when compared to the rectal cancer population. Sociodemographic variables are underreported, further limiting equal participation in clinical trials. Future rectal cancer RCTs should strive to recruit representative samples. To enhance recruitment of women and underrepresented minorities, tailored recruitment strategies must be implemented. [ABSTRACT FROM AUTHOR]

Humphreys, K., & Weisner, C. (2000). Use of exclusion criteria in selecting research subjects and its effect on the generalizability of alcohol treatment outcome studies. American Journal of Psychiatry, 157(4), 588–594.

OBJECTIVE: Researchers have not systematically examined how exclusion criteria used in selection of research subjects affect the generalizability of treatment outcome research. This study evaluated the use of exclusion criteria in alcohol treatment outcome research and its effects on the comparability of research subjects with real-world individuals seeking alcohol treatment. METHOD: Eight of the most common exclusion criteria described in the alcohol treatment research literature were operationalized and applied to large, representative clinical patient samples from the public and private sectors to determine whether the hypothetical research samples differed substantially from real-world samples. Five hundred ninety-three consecutive individuals seeking alcohol treatment at one of eight treatment programs participated. A trained research technician gathered information from participants on demographic variables and on alcohol, drug, and psychiatric problems as measured by the Addiction Severity Index. RESULTS: Large proportions of potential research subjects were excluded under most of the criteria tested. The overall pattern of results showed that African Americans, low-income individuals, and individuals who had more severe alcohol, drug, and psychiatric problems were disproportionately excluded under most criteria. CONCLUSIONS: Exclusion criteria can result in alcohol treatment outcome research samples that are more heavily composed of white, economically stable, and higher-functioning individuals than are real-world samples of substance abuse patients seen in clinical practice, potentially compromising the generalizability of results. For both scientific and ethical reasons, in addition to studies that use exclusion criteria, outcome research that uses no or minimal exclusion criteria should be conducted so that alcohol treatment outcome research can be better generalized to vulnerable populations.

Lavergne, M. R., Johnston, G. M., Gao, J., Dumont, S., & Burge, F. I. (2011). Exploring Generalizability in a Study of Costs for Community-Based Palliative Care. Journal of Pain & Symptom Management, 41(4), 779–787. Academic Search Alumni Edition.

Context: Palliative care researchers face challenges recruiting and retaining study subjects.
Objectives: This article investigates selection, study site, and participation biases to assess generalizability of a cost analysis of palliative care program (PCP) clients receiving care at home.
Methods: Study subjects’ sociodemographic, geographic, survival, disease, and treatment characteristics were compared for the same year and region with those of three populations. Comparison I was with nonstudy subjects enrolled in the PCP to assess selection bias. Comparison II was with adults who died of cancer to assess study site bias. Comparison III was with study-eligible persons who declined to participate in order to assess participation bias.
Results: Comparison I: When compared with the other 1010 PCP clients, the 50 study subjects were on average 3.6 years younger (P =0.03), enrolled 70 days longer in the PCP (P <0.001), lived 6.7km closer to the PCP (P <0.0001), and were more likely to have cancer (96.0% vs. 86.4%, P =0.05). Comparison II: Compared with all cancer decedents, the 45 study subjects who died of cancer were on average 7.0 years younger (P <0.001), lived 2.7km closer to the PCP (P <0.001), and were more likely to have had radiotherapy (62.2% vs. 33.8%, P <0.0001) and medical oncology (28.9% vs. 14.8%, P =0.01) consultations. Comparison III: The 50 study subjects lived on average 42 days longer after their diagnosis (P =0.03) and 2.6km closer to the PCP (P =0.01) than the 110 eligible persons who declined to participate.
Conclusion: If the study findings are applied to populations that differ from the study subjects, inaccurate conclusions are possible. [Copyright &y& Elsevier]

Licht, R. W. (2002). Limits of the applicability and generalizability of drug trials in mania. Bipolar Disorders, 4, 66–68. Academic Search Alumni Edition.

During recent years, the majority of drug trials in mania have been conducted for the purpose of drug approval. On this background, this paper addresses to what extent these trials may actually provide the practising clinician with useful information. One major point is that selection prior to the point of randomization in RCTs in mania may limit the applicability of study results to patients seen in ordinary clinical practice. Limitations in study credibility and study design are also discussed. The need for large scale pragmatic studies using broad inclusion criteria, comparing the various treatments, alone or in combination, is emphasized. [ABSTRACT FROM AUTHOR]

Medical Research Council Multcentre Otitis Media Study Group. (2001). Surgery for persistent otitis media with effusion: Generalizability of results from the UK trial (TARGET). Clinical Otolaryngology & Allied Sciences, 26(5), 417–424. Academic Search Alumni Edition.

TARGET (Trial of Alternative Regimens in Glue Ear Treatment) is a multicentre UK randomized controlled trial (RCT) comparing bilateral ventilation tubes with and without adjuvant adenoidectomy against non-surgical management in children with bilateral, persistent otitis media with effusion (OME). This paper compares the recruited and randomized children with those that, although eligible, were not included in the RCT for various reasons. This is necessary to identify any potential bias in the overall estimate of treatment effectiveness. At the first visit, 1315 children with OME satisfied the criteria of age (3 years 3 months-6 years 9 months), no previous ear or adenoid surgery, tympanometric evidence of fluid (bilateral B or B + C2) and a hearing loss (conductive loss in both ears of ≥20 dBHL). Of these children, 151 (11%) were not followed up because of overriding concern and 70 (5%) because of parental refusal. Of the 506 children eligible for randomization, because of persistence over 12 weeks of watchful waiting of bilateral OME with the same criteria, 20 (4%) were not randomized because of overriding concern and 75 (15%) because of parental refusal. The distribution of the potential effect modifiers was determined for each group. At the first visit, the only significant differences (P < 0.05), comparing those not recruited because of overriding concern with those recruited, were in respect of sex (61% girls compared with 52% boys) and hearing level (34.6 compared with 33.0 dBHL). At the second visit, the only significant difference involved less frequent upper respiratory tract infections (URTIs) in children whose parents refused to allow randomization (8% compared with 18% had had episodic URTI more often than once every 3 months). It is probable that the findings from the TARGET trial will translate to the entire clinic population in this age group as long as they meet the same audiometric and tympanometric criteria. [ABSTRACT FROM AUTHOR]

Morin-Ben Abdallah, S., Dutilleul, A., Nadon, V., Yang, J. W., Marchand-Sénécal, X., Van Nguyen, P., Lamarre-Cliche, M., Wistaff, R., Kolan, C., Laskine, M., & Durand, M. (2016). Quantification of the External Validity of Randomized Controlled Trials Supporting Clinical Care Guidelines: The Case of Thromboprophylaxis. American Journal of Medicine, 129(7), 740–745. Academic Search Alumni Edition.

Background: Clinical guidelines are based on the results of several randomized controlled trials. However, due to the stringent exclusion criteria of these trials, their external validity may be low. We aimed to evaluate the external validity of the randomized controlled trials cited in the American College of Chest Physicians guidelines for the use of pharmacological thromboprophylaxis in hospitalized medical patients.
Methods: We conducted a cross-sectional, chart-review study of a random sample of patients admitted between July 1, 2013 and June 30, 2014 to the Internal Medicine ward of a large Canadian teaching university hospital. We identified the proportion of our population presenting exclusion criteria used in the randomized controlled trials cited in support of clinical care guidelines on thromboprophylaxis in the medical setting.
Results: Nine trials were identified for a total of 28,793 included patients following 23 distinct exclusion criteria. We included 429 patients. Median age was 65 years (interquartile ratio 51-77 years), and 236 (55%) were males. Of those not already anticoagulated at admission (n = 351), between 26% and 67% (weighted average, 51%) of our population presented at least one exclusion criterion, making them ineligible to be enrolled in randomized controlled trials. When restricting our population to patients with an indication for thromboprophylaxis based on a Padua risk score at admission ≥4, 21% to 76% (weighted average 55%) were ineligible to be enrolled in individual trials.
Conclusions: Our cross-sectional study illustrates that the external validity of randomized controlled trials cited in the guidelines was low in our population, and lower when applying the risk-stratification tool recommended by guidelines. This can bias the clinicians toward treating patients that were not represented in the supporting evidence. [ABSTRACT FROM AUTHOR]

Okuda, M., Hasin, D. S., Olfson, M., Khan, S. S., Nunes, E. V., Montoya, I., Liu, S.-M., Grant, B. F., & Blanco, C. (2010). Generalizability of clinical trials for cannabis dependence to community samples. Drug and Alcohol Dependence, 111(1–2), 177–181

Orr, L. L., Olsen, R. B., Bell, S. H., Schmid, I., Shivji, A., & Stuart, E. A. (2019). Using the results from rigorous multisite evaluations to inform local policy decisions. Journal of Policy Analysis and Management, 38(4), 978–1003.

Evidence-based policy at the local level requires predicting the impact of an intervention to inform whether it should be adopted. Increasingly, local policymakers have access to published research evaluating the effectiveness of policy interventions from national research clearinghouses that review and disseminate evidence from program evaluations. Through these evaluations, local policymakers have a wealth of evidence describing what works, but not necessarily where. Multisite evaluations may produce unbiased estimates of the average impact of an intervention in the study sample and still produce inaccurate predictions of the impact for localities outside the sample for two reasons: (1) the impact of the intervention may vary across localities, and (2) the evaluation estimate is subject to sampling error.
Unfortunately, there is relatively little evidence on how much the impacts of policy interventions vary from one locality to another and almost no evidence on the implications of this variation for the accuracy with which the local impact of adopting an intervention can be predicted using findings from an evaluation in other localities. In this paper, we present a set of methods for quantifying the accuracy of the local predictions that can be obtained using the results of multisite randomized trials and for assessing the likelihood that prediction errors will lead to errors in local policy decisions.
We demonstrate these methods using three evaluations of educational interventions, providing the first empirical evidence of the ability to use multisite evaluations to predict impacts in individual localities—i.e., the ability of “evidence-based policy” to improve local policy.

Pil Hyung Lee, Se Hun Kang, Seungbong Han, Jung-Min Ahn, Jae Seok Bae, Cheol Hyun Lee, Soo-Jin Kang, Seung-Whan Lee, Young-Hak Kim, Cheol Whan Lee, Seong-Wook Park, Duk-Woo Park, Seung-Jung Park, Lee, P. H., Kang, S. H., Han, S., Ahn, J.-M., Bae, J. S., Lee, C. H., & Kang, S.-J. (2017). Generalizability of EXCEL and NOBLE results to a large registry population with unprotected left main coronary artery disease. Coronary Artery Disease, 28(8), 675–682. CINAHL.

Objective: The aim of this study was to determine how trial-based findings of EXCEL and NOBLE might be interpreted and generalizable in “real-world” settings with comparison of data from the large-scaled, all-comer Interventional Research Incorporation Society-Left MAIN Revascularization (IRIS-MAIN) registry. Patients and
Methods: We compared baseline clinical and procedural characteristics and also determined how the relative treatment effect of percutaneous coronary intervention (PCI) and coronary artery bypass grafting (CABG) was different in EXCEL and NOBLE, compared with those of the multicenter, IRIS-MAIN registry (n=2481). The primary outcome for between-study comparison was a composite of death, myocardial infarction (MI), or stroke.
Results: There were between-study differences in patient risk profiles (age, BMI, diabetes, and clinical presentation), lesion complexities, and procedural characteristics (stent type, the use of off-pump surgery, and radial artery); the proportion of diabetes and acute coronary syndrome was particularly lower in NOBLE than in other studies. Although there was interstudy heterogeneity for the protocol definition of MI, the risks for serious composite outcome of death, MI, or stroke were similar between PCI and CABG in EXCEL [hazard ratio (HR): 1.00; 95% confidence interval (CI): 0.79-1.26; P=0.98] and in the matched cohort of IRIS-MAIN (HR: 1.08; 95%CI: 0.85-1.38; P=0.53), whereas it was significantly higher after PCI than after CABG in NOBLE (HR: 1.47; 95%CI: 1.06-2.05; P=0.02), which was driven by more common MI and stroke after PCI.
Conclusion: In the comparison of a large-sized, all-comer registry, the EXCEL trial might represent better generalizability with respect to baseline characteristics and observed clinical outcomes compared with the NOBLE trial.

Pruchno, R. A., PhD, Brill, J. E., PhD, Shands, Y., Gordon, J. R., PhD, Genderson, M. W., PhD, Rose, M., MEd, & Cartwright, F. (2008). Convenience Samples and Caregiving Research: How Generalizable Are the Findings? The Gerontologist, 48(6), 820–827. Psychology Database.

Purpose: We contrast characteristics of respondents recruited using convenience strategies with those of respondents recruited by random digit dial (RDD) methods. We compare sample variances, means, and interrelationships among variables generated from the convenience and RDD samples.
Design and Methods: Women aged 50 to 64 who work full time and provide care to a community-dwelling older person were recruited using either RDD (N = 55) or convenience methods (N = 87). Telephone interviews were conducted using reliable, valid measures of demographics, characteristics of the care recipient, help provided to the care recipient, evaluations of caregiver-care recipient relationship, and outcomes common to caregiving research.
Results: Convenience and RDD samples had similar variances on 68.4% of the examined variables. We found significant mean differences for 63% of the variables examined. Bivariate correlations suggest that one would reach different conclusions using the convenience and RDD sample data sets.
Implications: Researchers should use convenience samples cautiously, as they may have limited generalizability. [PUBLICATION ABSTRACT]

Rothwell, P. M. (2005). External validity of randomised controlled trials:“to whom do the results of this trial apply?” The Lancet, 365(9453), 82–93.

Savoca, M. R., Ludwig, D. A., Jones, S. T., Jason Clodfelter, K., Sloop, J. B., Bollhalter, L. Y., & Bertoni, A. G. (2017). Geographic Information Systems to Assess External Validity in Randomized Trials. American Journal of Preventive Medicine, 53(2), 252–259. Academic Search Alumni Edition.

Introduction: To support claims that RCTs can reduce health disparities (i.e., are translational), it is imperative that methodologies exist to evaluate the tenability of external validity in RCTs when probabilistic sampling of participants is not employed. Typically, attempts at establishing post hoc external validity are limited to a few comparisons across convenience variables, which must be available in both sample and population. A Type 2 diabetes RCT was used as an example of a method that uses a geographic information system to assess external validity in the absence of a priori probabilistic community-wide diabetes risk sampling strategy.
Methods: A geographic information system, 2009-2013 county death certificate records, and 2013-2014 electronic medical records were used to identify community-wide diabetes prevalence. Color-coded diabetes density maps provided visual representation of these densities. Chi-square goodness of fit statistic/analysis tested the degree to which distribution of RCT participants varied across density classes compared to what would be expected, given simple random sampling of the county population. Analyses were conducted in 2016.
Results: Diabetes prevalence areas as represented by death certificate and electronic medical records were distributed similarly. The simple random sample model was not a good fit for death certificate record (chi-square, 17.63; p=0.0001) and electronic medical record data (chi-square, 28.92; p<0.0001). Generally, RCT participants were oversampled in high-diabetes density areas.
Conclusions: Location is a highly reliable “principal variable” associated with health disparities. It serves as a directly measurable proxy for high-risk underserved communities, thus offering an effective and practical approach for examining external validity of RCTs. [ABSTRACT FROM AUTHOR]

Schmoor, C., Olschewski, M., & Schumacher, M. (1996). Randomized and non-randomized patients in clinical trials: Experiences with comprehensive cohort studies. Statistics in Medicine, 15(3), 263–271.

In clinical research, randomized trials are widely accepted as the definitive method of evaluating the efficacy of therapies. Random assignment of patients to treatment ensures internal validity of the comparison of new treatments with controls. An assessment of external validity can best be achieved by comparing the randomized study sample to the population of patients who met the eligibility criteria but did not consent to randomization.
The Comprehensive Cohort Study (CCS) is designed to recruit all patients fulfilling the clinical eligibility criteria regardless of their consent to randomization. The CCS concept was adopted in the major clinical trials of the German Breast Cancer Study Group (GBSG) conducted between 1983 and 1989. In this period 124 centres recruited 2084 patients in three clinical trials. 734 (35 per cent) of these patients accepted being randomized, while 1350 (65 per cent) chose one of the treatments under study; the randomization rates differed remarkably between trials. In this paper we examine the representativeness of the randomized patients in the three trials.
Based on a median follow-up of about 5 years we present results on the external validity of the treatment effects estimated in the randomized patients by means of Cox’s proportional hazards model and compare them between trials. We discuss advantages and disadvantages of the CCS design and conclude that its use is only justified under extraordinary circumstances.

Stirman, S. W., DeRubeis, R. J., Crits-Christoph, P., & Rothman, A. (2005). Can the randomized controlled trial literature generalize to nonrandomized patients? Journal of Consulting and Clinical Psychology, 73(1), 127.

To determine the extent to which published randomized controlled trials (RCTs) of psychotherapy can be generalized to a sample of outpatients, the authors matched information obtained from charts of patients who had been screened out of RCTs to inclusion and exclusion criteria from published RCT studies. Most of the patients in the sample who had primary diagnoses represented in the RCT literature were judged eligible for at least 1 RCT. However, many patients in the sample with substance use disorders or social anxiety disorder were not eligible for at least 2 RCTs.
Common reasons that patients did not match with at least 2 published RCTs for psychotherapy included (a) patients were in partial remission, (b) patients failed to meet minimum severity or duration criteria, © patients were being treated with antidepressant medication, and (d) the disorder being studied was not primary (mostly for social anxiety patients). The implications of these findings for future research and clinical practice are discussed.

Stuart, E. A., Bell, S. H., Ebnesajjad, C., Olsen, R. B., & Orr, L. L. (2017).. Characteristics of School Districts That Participate in Rigorous National Educational Evaluations. Journal of Research on Educational Effectiveness, 10(1), 168–206. ERIC.

Given increasing interest in evidence-based policy, there is growing attention to how well the results from rigorous program evaluations may inform policy decisions. However, little attention has been paid to documenting the characteristics of schools or districts that participate in rigorous educational evaluations, and how they compare to potential target populations for the interventions that were evaluated. Utilizing a list of the actual districts that participated in 11 large-scale rigorous educational evaluations, we compare those districts to several different target populations of districts that could potentially be affected by policy decisions regarding the interventions under study.
We find that school districts that participated in the 11 rigorous educational evaluations differ from the interventions’ target populations in several ways, including size, student performance on state assessments, and location (urban/rural). These findings raise questions about whether, as currently implemented, the results from rigorous impact studies in education are likely to generalize to the larger set of school districts–and thus schools and students–of potential interest to policymakers, and how we can improve our study designs to retain strong internal validity while also enhancing external validity.

Surgery for persistent otitis media with effusion: Generalizability of results from the UK trial (TARGET). (2001). Generalizability of results from the UK trial (TARGET). (2001). Clinical Otolaryngology & Allied Sciences, 26(5), 417–424. Academic Search Alumni Edition.

TARGET (Trial of Alternative Regimens in Glue Ear Treatment) is a multicentre UK randomized controlled trial (RCT) comparing bilateral ventilation tubes with and without adjuvant adenoidectomy against non-surgical management in children with bilateral, persistent otitis media with effusion (OME).
This paper compares the recruited and randomized children with those that, although eligible, were not included in the RCT for various reasons. This is necessary to identify any potential bias in the overall estimate of treatment effectiveness. At the first visit, 1315 children with OME satisfied the criteria of age (3 years 3 months-6 years 9 months), no previous ear or adenoid surgery, tympanometric evidence of fluid (bilateral B or B + C2) and a hearing loss (conductive loss in both ears of ≥20 dBHL). Of these children, 151 (11%) were not followed up because of overriding concern and 70 (5%) because of parental refusal. Of the 506 children eligible for randomization, because of persistence over 12 weeks of watchful waiting of bilateral OME with the same criteria, 20 (4%) were not randomized because of overriding concern and 75 (15%) because of parental refusal. The distribution of the potential effect modifiers was determined for each group.
At the first visit, the only significant differences (P < 0.05), comparing those not recruited because of overriding concern with those recruited, were in respect of sex (61% girls compared with 52% boys) and hearing level (34.6 compared with 33.0 dBHL). At the second visit, the only significant difference involved less frequent upper respiratory tract infections (URTIs) in children whose parents refused to allow randomization (8% compared with 18% had had episodic URTI more often than once every 3 months). It is probable that the findings from the TARGET trial will translate to the entire clinic population in this age group as long as they meet the same audiometric and tympanometric criteria. [ABSTRACT FROM AUTHOR]

Susukida, R., Crum, R. M., Stuart, E. A., Ebnesajjad, C., & Mojtabai, R. (2016). Assessing sample representativeness in randomized controlled trials: Application to the National Institute of Drug Abuse Clinical Trials Network. Addiction, 111(7), 1226–1234.

Aims To compare the characteristics of individuals participating in randomized controlled trials (RCTs) of treatments of substance use disorder (SUD) with individuals receiving treatment in usual care settings, and to provide a summary quantitative measure of differences between characteristics of these two groups of individuals using propensity score methods. Design Analyses using data from RCT samples from the National Institute of Drug Abuse Clinical Trials Network (CTN) and target populations of patients drawn from the Treatment Episodes Data Set—Admissions (TEDS-A). Settings Multiple clinical trial sites and nation-wide usual SUD treatment settings in the United States. Participants A total of 3592 individuals from 10 CTN samples and 1 602 226 individuals selected from TEDS-A between 2001 and 2009. Measurements The propensity scores for enrolling in the RCTs were computed based on the following nine observable characteristics: sex, race/ethnicity, age, education, employment status, marital status, admission to treatment through criminal justice, intravenous drug use and the number of prior treatments. Findings The proportion of those with ≥ 12 years of education and the proportion of those who had full-time jobs were significantly higher among RCT samples than among target populations (in seven and nine trials, respectively, at P < 0.001). The pooled difference in the mean propensity scores between the RCTs and the target population was 1.54 standard deviations and was statistically significant at P < 0.001. Conclusions In the United States, individuals recruited into randomized controlled trials of substance use disorder treatments appear to be very different from individuals receiving treatment in usual care settings. Notably, RCT participants tend to have more years of education and a greater likelihood of full-time work compared with people receiving care in usual care settings.

Tipton, E., Spybrook, J., Fitzgerald, K. G., Wang, Q., & Davidson, C. (2021). Toward a System of Evidence for All: Current Practices and Future Opportunities in 37 Randomized Trials. Educational Researcher, 50(3), 145–156.

As a result of the evidence-based decision-making movement, the number of randomized trials evaluating educational programs and curricula has increased dramatically over the past 20 years. Policy makers and practitioners are encouraged to use the results of these trials to inform their decision making in schools and school districts. At the same time, however, little is known about the schools taking part in these randomized trials, both regarding how and why they were recruited and how they compare to populations in need of research. In this article, we report on a study of 37 cluster randomized trials funded by the Institute of Education Sciences between 2011 and 2015.
Principal investigators of these grants were interviewed regarding the recruitment process and practices. Additionally, data on the schools included in 34 of these studies were analyzed to determine the general demographics of schools included in funded research, as well as how these samples compare to important policy relevant populations.
We show that the types of schools included in research differ in a variety of ways from these populations. Large schools from large school districts in urban areas were overrepresented, whereas schools from small school districts in rural areas and towns are underrepresented. The article concludes with a discussion of how recruitment practices might be improved in order to meet the goals of the evidence-based decision-making movement.

Topp, L., Barker, B., & Degenhardt, L. (2004).The external validity of results derived from ecstasy users recruited using purposive sampling strategies. Drug & Alcohol Dependence, 73(1), 33. Academic Search Alumni Edition.

This study sought to compare the patterns and correlates of ‘recent’ and ‘regular’ ecstasy use estimated on the basis of two datasets generated in 2001 in New South Wales, Australia, from a probability and a non-probability sample. The first was the National Drug Strategy Household Survey (NDSHS), a multistage probability sample of the general population; and the second was the Illicit Drug Reporting System (IDRS) Party Drugs Module, for which regular ecstasy users were recruited using purposive sampling strategies. NDSHS recent ecstasy users (any use in the preceding 12 months) were compared on a range of demographic and drug use variables to NDSHS regular ecstasy users (at least monthly use in the preceding 12 months) and purposively sampled regular ecstasy users (at least monthly use in the preceding 6 months). The demographic characteristics of the three samples were consistent. Among all three, the mean age was approximately 25 years, and a majority (60%) of subjects were male, relatively well-educated, and currently employed or studying. Patterns of ecstasy use were similar among the three samples, although compared to recent users, regular users were likely to report more frequent use of ecstasy. All samples were characterised by extensive polydrug use, although the two samples of regular ecstasy users reported higher rates of other illicit drug use than the sample of recent users.
The similarities between the demographic and drug use characteristics of the samples are striking, and suggest that, at least in NSW, purposive sampling that seeks to draw from a wide cross-section of users and to sample a relatively large number of individuals, can give rise to samples of ecstasy users that may be considered sufficiently representative to reasonably warrant the drawing of inferences relating to the entire population. These findings may partially offset concerns that purposive samples of ecstasy users are likely to remain a primary source of ecstasy-related information. [Copyright &y& Elsevier]

Travers, J., Marsh, S., Williams, M., Weatherall, M., Caidwell, B., Shirtcliffe, P., Aldington, S., & Beasley, R. (2007). External validity of randomised controlled trials in asthma: To whom do the results of the trials apply? Thorax, 62(3), 219–223. Academic Search Alumni Edition.

Background: Asthma is a heterogeneous disease with a wide range of clinical phenotypes, not all of which may be encompassed in the subjects included in randomised controlled trials (RCTs). This makes it difficult for clinicians to know to what extent the evidence derived from RCTs applies to a given patient. Aim: To calculate the proportion of individuals with asthma who would have been eligible for the major asthma RCTs from the data of a random community survey of respiratory health.
Methods: A postal survey was sent to 3500 randomly selected individuals aged 25-75 years. Respondents were invited to complete a detailed respiratory questionnaire and pulmonary function testing. Participants with current asthma were assessed against the eligibility criteria of the 17 major asthma RCTs cited in the Global Initiative for Asthma (GINA) guidelines. - Findings: A total of 749 participants completed the full survey, of whom 179 had current asthma. A median 4% of participants with current asthma (range 0-36%) met the eligibility criteria for the included RCTs. A median 6% (range 0-43%) of participants with current asthma on treatment met the eligibility criteria. Interpretation: This study shows that the major asthma RCTs on which the GINA guidelines are based may have limited external validity as they have been performed on highly selected patient populations. Most of the participants with current asthma on treatment in the community would not have been eligible for these RCTs. [ABSTRACT FROM AUTHOR

Wagner, T. H., Holman, W., Lee, K., Sethi, G., Ananth, L., Thai, H., & Goldman, S. (2011). The generalizability of participants in Veterans Affairs Cooperative Studies Program 474, a multi-site randomized cardiac bypass surgery trial. Contemporary Clinical Trials, 32(2), 260–266. Academic Search Alumni Edition.

Abstract: Objective: The Department of Veterans Affairs (VA) Cooperative Studies Program (CSP) initiated a multi-site randomized trial (CSP 474) to determine graph patency between radial artery or saphenous vein grafts in coronary artery bypass surgery (CABG). In this paper, we describe the study and compare participants’’ baseline characteristics to non-participants who received CABG surgery in the VA.
Method: We identified our participants in the VA administrative databases along with all other CABG patients who did not have a concomitant valve procedure between FY2003 and FY2008. We extracted demographic, clinical information and organizational information at the time of the surgery from the databases. We conducted multiple logistic regression to determine characteristics associated with participation at three levels: between participants and non-participants within participating sites, between participating sites and non-participating sites, between participants and all non-participants.
Results: Enrollment ended in early 2008. Participants were similar to non-participants across many parameters. Likewise, participating sites were also quite similar to non-participating sites, although participating sites had a higher volume of CABG surgery, a lower percentage of CABG patients with a prior inpatient mental health admission than non-participating sites. After controlling for site differences, CSP 474 participants were younger and had fewer co-morbid conditions than non-participants.
Conclusions: Participants were significantly younger than non-participants. Participants also had lower rates of some cardiac-related illness including, congestive heart failure, peripheral vascular disease, and cerebrovascular disease than non-participants. [Copyright &y& Elsevier]

Wisniewski, S. R., Rush, A. J., Nierenberg, A. A., Gaynes, B. N., Warden, D., Luther, J. F., McGrath, P. J., Lavori, P. W., Thase, M. E., & Fava, M. (2009). Can phase III trial results of antidepressant medications be generalized to clinical practice? A STAR* D report. American Journal of Psychiatry, 166(5), 599–607.

Last updated on Jun 17, 2022

Empirical Evidence on Generalizability

Allcott, H. (2015). Site selection bias in program evaluation. The Quarterly Journal of Economics, 130(3), 1117–1165.

Bell, S. H., Olsen, R. B., Orr, L. L., & Stuart, E. A. (2016). Estimates of external validity bias when impact evaluations select sites nonrandomly. Educational Evaluation and Policy Analysis, 38(2), 318–335.

Biener, L., DePue, J. D., Emmons, K. M., Linnan, L., & Abrams, D. B. (1994). Recruitment of work sites to a health promotion research trial. Implications for generalizability. Journal of Occupational Medicine: Official Publication of the Industrial Medical Association, 36(6), 631–636.

Blanco, C., Hoertel, N., Franco, S., Olfson, M., He, J.-P., López, S., González-Pinto, A., Limosin, F., & Merikangas, K. R. (2017). Generalizability of Clinical Trial Results for Adolescent Major Depressive Disorder. Pediatrics, 140(6).

Braslow, J. T., Duan, N., Starks, S. L., Polo, A., Bromley, E., & Wells, K. B. (2005). Generalizability of studies on mental health treatment and outcomes, 1981 to 1996. Psychiatric Services (Washington, D.C.), 56(10), 1261–1268.

Canevelli, M., Trebbastoni, A., Quarata, F., D’Antonio, F., Cesari, M., de Lena, C., & Bruno, G. (2017). External Validity of Randomized Controlled Trials on Alzheimer’s Disease: The Biases of Frailty and Biological Aging. Frontiers in Neurology, 8, 628

Chaitoff, A., Niforatos, J. D., Gong, J., & Fischer, M. A. (2022). A Comparison of Individuals with Diabetes and EMPA-REG Trial Participants: Exploring Aspects of External Validity. Journal of General Internal Medicine.

Coppock, A., Leeper, T. J., & Mullinix, K. J. (2018). Generalizability of heterogeneous treatment effect estimates across samples. Proceedings of the National Academy of Sciences of the United States of America, 115(49), 12441–12446. Academic Search Alumni Edition.

Gheorghe, A., Roberts, T., Hemming, K., & Calvert, M. (2015). Evaluating the Generalisability of Trial Results: Introducing a Centre- and Trial-Level Generalisability Index. PharmacoEconomics, 33(11), 1195–1214.

Greenhouse, J. B., Kaizar, E. E., Kelleher, K., Seltman, H., & Gardner, W. (2008).. Generalizing from clinical trial data: A case study. The risk of suicidality among pediatric antidepressant users. Statistics in Medicine, 27(11), 1801–1813.

Hotz, V. J., Imbens, G. W., & Mortimer, J. H. (2005). Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics, 125(1–2), 241–270.

Howard-Pitney, B., Fortmann, S. P., & Killen, J. D. (2001). Generalizability of findings from a chewing tobacco cessation clinical trial. Nicotine & Tobacco Research : Official Journal of the Society for Research on Nicotine and Tobacco, 3(4), 347–352.

Hsu, S., Rosen, K. J., Cupertino, A., Temple, L., & Fleming, F. (2022). Generalizability of Randomized Controlled Trials in Rectal Cancer. Journal of Gastrointestinal Surgery, 26(2), 453–465. Academic Search Alumni Edition.

Humphreys, K., & Weisner, C. (2000). Use of exclusion criteria in selecting research subjects and its effect on the generalizability of alcohol treatment outcome studies. American Journal of Psychiatry, 157(4), 588–594.

Lavergne, M. R., Johnston, G. M., Gao, J., Dumont, S., & Burge, F. I. (2011). Exploring Generalizability in a Study of Costs for Community-Based Palliative Care. Journal of Pain & Symptom Management, 41(4), 779–787. Academic Search Alumni Edition.

Licht, R. W. (2002). Limits of the applicability and generalizability of drug trials in mania. Bipolar Disorders, 4, 66–68. Academic Search Alumni Edition.

Medical Research Council Multcentre Otitis Media Study Group. (2001). Surgery for persistent otitis media with effusion: Generalizability of results from the UK trial (TARGET). Clinical Otolaryngology & Allied Sciences, 26(5), 417–424. Academic Search Alumni Edition.

Okuda, M., Hasin, D. S., Olfson, M., Khan, S. S., Nunes, E. V., Montoya, I., Liu, S.-M., Grant, B. F., & Blanco, C. (2010). Generalizability of clinical trials for cannabis dependence to community samples. Drug and Alcohol Dependence, 111(1–2), 177–181

Orr, L. L., Olsen, R. B., Bell, S. H., Schmid, I., Shivji, A., & Stuart, E. A. (2019). Using the results from rigorous multisite evaluations to inform local policy decisions. Journal of Policy Analysis and Management, 38(4), 978–1003.

Pruchno, R. A., PhD, Brill, J. E., PhD, Shands, Y., Gordon, J. R., PhD, Genderson, M. W., PhD, Rose, M., MEd, & Cartwright, F. (2008). Convenience Samples and Caregiving Research: How Generalizable Are the Findings? The Gerontologist, 48(6), 820–827. Psychology Database.

Rothwell, P. M. (2005). External validity of randomised controlled trials:“to whom do the results of this trial apply?” The Lancet, 365(9453), 82–93.

Savoca, M. R., Ludwig, D. A., Jones, S. T., Jason Clodfelter, K., Sloop, J. B., Bollhalter, L. Y., & Bertoni, A. G. (2017). Geographic Information Systems to Assess External Validity in Randomized Trials. American Journal of Preventive Medicine, 53(2), 252–259. Academic Search Alumni Edition.

Schmoor, C., Olschewski, M., & Schumacher, M. (1996). Randomized and non-randomized patients in clinical trials: Experiences with comprehensive cohort studies. Statistics in Medicine, 15(3), 263–271.

Stirman, S. W., DeRubeis, R. J., Crits-Christoph, P., & Rothman, A. (2005). Can the randomized controlled trial literature generalize to nonrandomized patients? Journal of Consulting and Clinical Psychology, 73(1), 127.

Stuart, E. A., Bell, S. H., Ebnesajjad, C., Olsen, R. B., & Orr, L. L. (2017).. Characteristics of School Districts That Participate in Rigorous National Educational Evaluations. Journal of Research on Educational Effectiveness, 10(1), 168–206. ERIC.

Surgery for persistent otitis media with effusion: Generalizability of results from the UK trial (TARGET). (2001). Generalizability of results from the UK trial (TARGET). (2001). Clinical Otolaryngology & Allied Sciences, 26(5), 417–424. Academic Search Alumni Edition.

Susukida, R., Crum, R. M., Stuart, E. A., Ebnesajjad, C., & Mojtabai, R. (2016). Assessing sample representativeness in randomized controlled trials: Application to the National Institute of Drug Abuse Clinical Trials Network. Addiction, 111(7), 1226–1234.

Tipton, E., Spybrook, J., Fitzgerald, K. G., Wang, Q., & Davidson, C. (2021). Toward a System of Evidence for All: Current Practices and Future Opportunities in 37 Randomized Trials. Educational Researcher, 50(3), 145–156.

Topp, L., Barker, B., & Degenhardt, L. (2004).The external validity of results derived from ecstasy users recruited using purposive sampling strategies. Drug & Alcohol Dependence, 73(1), 33. Academic Search Alumni Edition.

Travers, J., Marsh, S., Williams, M., Weatherall, M., Caidwell, B., Shirtcliffe, P., Aldington, S., & Beasley, R. (2007). External validity of randomised controlled trials in asthma: To whom do the results of the trials apply? Thorax, 62(3), 219–223. Academic Search Alumni Edition.

Howard-Pitney, B., Fortmann, S. P., & Killen, J. D. (2001). Generalizability of findings from a chewing tobacco cessation clinical trial. Nicotine & Tobacco Research : Official Journal of the Society for Research on Nicotine and Tobacco, 3(4), 347–352.