Other


Augustovski, F., Iglesias, C., Manca, A., Drummond, M., Rubinstein, A., & Martí, S. G. (2009). Barriers to Generalizability of Health Economic Evaluations in Latin America and the Caribbean Region. PharmacoEconomics, 27(11), 919–929. Academic Search Alumni Edition.

  • Use and acceptance of health economic evaluations (HEEs) has been much greater in developed than in developing nations. Nevertheless, while developing countries lag behind in the development of HEE methods, they could benefit from the progress made in other countries and concentrate on ways in which existing methods can be used or would need to be modified to fulfill their specific needs. HEEs, as context-specific tools, are not easily generalizable from setting to setting. Existing studies regarding generalizability and transferability of HEEs have primarily been conducted in developed countries. Therefore, a legitimate question for policy makers in Latin America and the Caribbean region (LAC) is to what extent HEEs conducted in industrialized economies and in LAC are generalizable to LAC (trans-regional) and to other LAC countries (intra-regional), respectively.

  • We conducted a systematic review, searching the NHS Economic Evaluation Database (NHS EED), Office of Health Economics Health Economic Evaluation Database (HEED), LILACS (Latin America health bibliographic database) and NEVALAT (Latin American Network on HEE) to identify HEEs published between 1980 and 2004. We included individual patient- and model-based HEEs (cost-effectiveness, cost-utility, cost-benefit and cost-consequences analyses) that involved at least one LAC country. Data were extracted by three independent reviewers using a checklist validated by regional and international experts.

  • From 521 studies retrieved, 72 were full HEEs (39% randomized controlled trials [RCTs], 32% models, 17% non-randomized studies and 12% mixed trial modeling approach). Over one-third of identified studies did not specifically report the type of HEE. Cost-effectiveness and cost-consequence analyses accounted for almost 80%of the studies. The three Latin American countries with the highest participation in HEE studies were Brazil, Argentina and Mexico.

  • While we found relatively good standards of reporting the study’s question, population, interventions, comparators and conclusions, the overall reporting was poor, and evidence of unfamiliarity with international guidelines was evident (i.e. absence of incremental analysis, of discounting long-term costs and effects). Analysis or description of place-to-place variability was infrequent. Of the 49 trial-based analyses, 43% were single centre, 33% multinational and 18\% multicentre national. Main reporting problems included issues related to sample representativeness, data collection and data analysis. Of the 32 model-based studies (most commonly using epidemiological models), main problems included the inadequacy of search strategy, range selection for sensitivity analysis and theoretical justifications. There are a number of issues associated with the reporting and methodology used in multinational and local HEE studies relevant for LAC that preclude the assessment of their generalizability and potential transferability. Although the quality of reporting and methodology used in model-based HEEs was somewhat higher than those from trial-based HEEs, economic evaluation methodology was usually weak and less developed than the analysis of clinical data. Improving these aspects in LAC HEE studies is paramount to maximizing their potential benefits such as increasing the generalizability/transferability of their results. [ABSTRACT FROM AUTHOR].

Baker, S. G., & Kramer, B. S. (2003). Randomized trials, generalizability, and meta-analysis: Graphical insights for binary outcomes. BMC Medical Research Methodology, 3, 10.

  • BACKGROUND: Randomized trials stochastically answer the question. “What would be the effect of treatment on outcome if one turned back the clock and switched treatments in the given population?” Generalizations to other subjects are reliable only if the particular trial is performed on a random sample of the target population. By considering an unobserved binary variable, we graphically investigate how randomized trials can also stochastically answer the question, “What would be the effect of treatment on outcome in a population with a possibly different distribution of an unobserved binary baseline variable that does not interact with treatment in its effect on outcome?” .

  • METHOD: For three different outcome measures, absolute difference (DIF), relative risk (RR), and odds ratio (OR), we constructed a modified BK-Plot under the assumption that treatment has the same effect on outcome if either all or no subjects had a given level of the unobserved binary variable. (A BK-Plot shows the effect of an unobserved binary covariate on a binary outcome in two treatment groups; it was originally developed to explain Simpsons’s paradox.)

  • RESULTS: For DIF and RR, but not OR, the BK-Plot shows that the estimated treatment effect is invariant to the fraction of subjects with an unobserved binary variable at a given level.

  • CONCLUSION: The BK-Plot provides a simple method to understand generalizability in randomized trials. Meta-analyses of randomized trials with a binary outcome that are based on DIF or RR, but not OR, will avoid bias from an unobserved covariate that does not interact with treatment in its effect on outcome.

Baker, S. G., & Kramer, B. S. (2008). Randomized trials for the real world: Making as few and as reasonable assumptions as possible. Statistical Methods in Medical Research, 17(3), 243–252.

  • The strength of the randomized trial to yield conclusions not dependent on assumptions applies only in an ideal setting. In the real world various complications such as loss-to-follow-up, missing outcomes, noncompliance and nonrandom selection into a trial force a reliance on assumptions. To handle real world complications, it is desirable to make as few and as reasonable assumptions as possible.
  • This article reviews four techniques for using a few reasonable assumptions to design or analyse randomized trials in the presence of specific real world complications: 1) a double sampling design for survival data to avoid strong assumptions about informative censoring, 2) sensitivity analysis for partially missing binary outcomes that uses the randomization to reduce the number of parameters specified by the investigator, 3) an estimate of the effect of treatment received in the presence of all-or-none compliance that requires reasonable assumptions, and 4) statistics for binary outcomes that avoid some assumptions for generalizing results to a target population.

Bisbee, J., Dehejia, R., Pop-Eleches, C., & Samii, C. (2017). Local Instruments, Global Extrapolation: External Validity of the Labor Supply-Fertility Local Average Treatment Effect. Journal of Labor Economics, 35, S99–S147. Business Source Alumni Edition.

  • We investigate the external validity of local average treatment effects (LATEs), specifically Angrist and Evans’s use of same sex of the two first children as an instrumental variable for the effect of fertility on labor supply. We estimate their specification in 139 country-year censuses using Integrated Public Use Microdata Sample-International data. We compare each country-year’s actual LATE to the extrapolated LATE from other country-years.
  • We find that, with a sufficiently large reference sample, we extrapolate the treatment effect reasonably well, but the degree of accuracy depends on the extent of covariate similarity between the target and reference settings. [ABSTRACT FROM AUTHOR]

Bloom, H. S., Porter, K. E., & Society for Research on Educational Effectiveness (SREE). (2012). Assessing the Generalizability of Estimates of Causal Effects from Regression Discontinuity Designs. In Society for Research on Educational Effectiveness (Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; Fax: 202-640-4401; e-mail: inquiries@sree.org; Web site: http://www.sree.org). Society for Research on Educational Effectiveness; ERIC.

  • In recent years, the regression discontinuity design (RDD) has gained widespread recognition as a quasi-experimental method that when used correctly, can produce internally valid estimates of causal effects of a treatment, a program or an intervention (hereafter referred to as treatment effects).
  • In an RDD study, subjects or groups of subjects (e.g. students or schools) are rated according to a numeric index (a performance indicator, poverty measure, etc.) and treatment assignment is determined by whether one’s rating falls above or below an exogenously defined cut-point value of the rating. RDDs have been used to estimate causal effects in a variety of contexts (e.g. for a list of more than 75 studies in the contexts of education, labor markets, political economy, health, crime and more see Lee & Lemieux, 2009), and research on their statistical properties has provided theoretical justification and empirical verification of their internal validity.
  • This paper explores the conditions that limit the generalizability of RDD estimates and concludes that that in many cases, generalizability is much greater than often believed. It also presents an empirical approach for quantifying the generalizability of RDD findings so that more information can be brought to bear on this important issue. (Contains 1 figure.)

Bryan, C. J., Tipton, E., & Yeager, D. S. (2021). Behavioural science is unlikely to change the world without a heterogeneity revolution. Nature Human Behaviour, 5(8), 980–989.

  • In the past decade, behavioural science has gained influence in policymaking but suffered a crisis of confidence in the replicability of its findings. Here, we describe a nascent heterogeneity revolution that we believe these twin historical trends have triggered. This revolution will be defined by the recognition that most treatment effects are heterogeneous, so the variation in effect estimates across studies that defines the replication crisis is to be expected as long as heterogeneous effects are studied without a systematic approach to sampling and moderation. When studied systematically, heterogeneity can be leveraged to build more complete theories of causal mechanism that could inform nuanced and dependable guidance to policymakers. We recommend investment in shared research infrastructure to make it feasible to study behavioural interventions in heterogeneous and generalizable samples, and suggest low-cost steps researchers can take immediately to avoid being misled by heterogeneity and begin to learn from it instead.

Chang, Y. (2013). Variable Selection via Regression Trees in the Presence of Irrelevant Variables. Communications in Statistics: Simulation & Computation, 42(8), 1703–1726. Academic Search Alumni Edition.

  • Many tree algorithms have been developed for regression problems. Although they are regarded as good algorithms, most of them suffer from loss of prediction accuracy when there are many irrelevant variables and the number of predictors exceeds the number of observations.
  • We propose the multistep regression tree with adaptive variable selection to handle this problem. The variable selection step and the fitting step comprise the multistep method. The multistep generalized unbiased interaction detection and estimation (GUIDE) with adaptive forward selection (fg) algorithm, as a variable selection tool, performs better than some of the well-known variable selection algorithms such as efficacy adaptive regression tube hunting (EARTH), FSR (false selection rate), LSCV (least squares cross-validation), and LASSO (least absolute shrinkage and selection operator) for the regression problem.
  • The results based on simulation study show that fg outperforms other algorithms in terms of selection result and computation time. It generally selects the important variables correctly with relatively few irrelevant variables, which gives good prediction accuracy with less computation time. [ABSTRACT FROM AUTHOR]

Dahabreh, I. J., Petito, L. C., Robertson, S. E., Hernán, M. A., & Steingrimsson, J. A. (2020). Toward causally interpretable meta-analysis: Transporting inferences from multiple randomized trials to a new target population. Epidemiology, 31(3), 334–344.

  • We take steps toward causally interpretable meta-analysis by describing methods for transporting causal inferences from a collection of randomized trials to a new target population, one trial at a time and pooling all trials.
  • We discuss identifiability conditions for average treatment effects in the target population and provide identification results. We show that the assumptions that allow inferences to be transported from all trials in the collection to the same target population have implications for the law underlying the observed data. We propose average treatment effect estimators that rely on different working models and provide code for their implementation in statistical software.
  • We discuss how to use the data to examine whether transported inferences are homogeneous across the collection of trials, sketch approaches for sensitivity analysis to violations of the identifiability conditions, and describe extensions to address nonadherence in the trials. Last, we illustrate the proposed methods using data from the Hepatitis C Antiviral Long-Term Treatment Against Cirrhosis Trial.

Fuller, J. (2013). Rationality and the generalization of randomized controlled trial evidence. Journal of Evaluation in Clinical Practice, 19(4), 644–647. Academic Search Alumni Edition.

The author discusses the rationality and generalization of the evidence of randomized controlled trial (RCT). The author cites the systematic review conducted by P. Post and colleagues which showed that few authors have introduced an approach for generalizing the efficacy results of RTCs in the medical literature. The author concludes that the systematic review confirmed the absence of consensus for the best method of generalizing RCT efficiency results.

Green, J., Roberts, H., Petticrew, M., Steinbach, R., Goodman, A., Jones, A., & Edwards, P. (2015). Integrating quasi-experimental and inductive designs in evaluation: A case study of the impact of free bus travel on public health. Evaluation, 21(4), 391–406.

Evaluations of natural experiments in public policy are typically considered ?weak? evidence. Challenges include: making credible claims for causal inference (internal validity); generalizing beyond the case (external validity); and providing useful evidence for decision makers. In public health, where experimental evidence is encouraged by funders and enjoys a degree of rhetorical favour, in theory if not practice, current guidance for evaluating natural experiments focuses largely on methods for strengthening internal validity. Using a case study of the evaluation of free bus travel for young people in London, UK, we demonstrate a pragmatic approach to strengthening both internal and external validity in evaluations through integrating the logic of quasi-experimental methods with inductive qualitative analysis. Combining theoretical and inductive analysis in this way to address questions of policy interest through evaluations of natural experiments may be fruitful, and have methodological advantages over randomized designs. Tags: Other

Green, K. M., & Stuart, E. A. (2014). Examining moderation analyses in propensity score methods: Application to depression and substance use. Journal of Consulting and Clinical Psychology, 82(5), 773.

  • Objective: This study provides guidance on how propensity score methods can be combined with moderation analyses (i.e., effect modification) to examine subgroup differences in potential causal effects in nonexperimental studies. As a motivating example, we focus on how depression may affect subsequent substance use differently for men and women. Method: Using data from a longitudinal community cohort study (N = 952) of urban African Americans with assessments in childhood, adolescence, young adulthood, and midlife, we estimate the influence of depression by young adulthood on substance use outcomes in midlife, and whether that influence varies by gender. We illustrate and compare 5 different techniques for estimating subgroup effects using propensity score methods, including separate propensity score models and matching for men and women, a joint propensity score model for men and women with matching separately and together by gender, and a joint male/female propensity score model that includes theoretically important gender interactions with matching separately and together by gender. Results: Analyses showed that estimating separate models for men and women yielded the best balance and, therefore, is a preferred technique when subgroup analyses are of interest, at least in this data. Results also showed substance use consequences of depression but no significant gender differences. Conclusions: It is critical to prespecify subgroup effects before the estimation of propensity scores and to check balance within subgroups regardless of the type of propensity score model used. Results also suggest that depression may affect multiple substance use outcomes in midlife for both men and women relatively equally. (PsycInfo Database Record © 2020 APA, all rights reserved)

Hanushek, E. A. (2021). Addressing cross-national generalizability in educational impact evaluation. International Journal of Educational Development, 80, N.PAG-N.PAG. Academic Search Alumni Edition.

  • Institutional variations limit generalizability of country-specific educational evaluations. • International differences in school institutions are important for generalizability. • Strong internal validity does not sure ability to generalize across countries. Evaluation of educational programs has accelerated dramatically in the past quarter century. With this expansion has come clear methodological improvement involving randomized control studies and other approaches for establishing causation that considerably strengthen their internal validity. Such studies are, however, conducted within individual countries with the institutional structure of the schools and the national labor markets, and they are seldom replicated either within or across countries. A natural question is whether the results of an individual high-quality educational evaluation in one country can be reasonably applied in other countries. This paper focuses on existing research into differences across countries that, while generally impossible to incorporate into program evaluations, potentially have direct effects on key elements of policy and on the outcomes that can be expected. In particular, available cross-national studies on a variety of topics suggest using caution when generalizing evaluation results across countries, because student results are likely to vary systematically with a number of fundamental country-level institutional characteristics that are not explicitly considered in within-country evaluation analyses. Unfortunately, there is currently too little replication of basic research studies to provide explicit guidance on when and where cross-national generalizations are possible. [ABSTRACT FROM AUTHOR]

Kaplan, A., Cromley, J., Perez, T., Dai, T., Mara, K., & Balsai, M. (2020). The Role of Context in Educational RCT Findings: A Call to Redefine “Evidence-Based Practice.” Grantee Submission. ERIC.

  • In this commentary, we complement other constructive critiques of educational randomized control trials (RCTs) by calling attention to the commonly ignored role of context in causal mechanisms undergirding educational phenomena. We argue that evidence for the central role of context in causal mechanisms challenges the assumption that RCT findings can be uncritically generalized across settings. Anchoring our argument with an example from our own multi-study RCT project, we argue that the scientific pursuit of causal explanation should involve the rich description of contextualized causal effects. We further call for incorporating the evidence of the integral role of context in causal mechanisms into the meaning of “evidence-based practice,” with the implication that effective implementation of practice in a new setting must involve context-oriented, evidence-focused, design-based research that attends to the emergent, complex, and dynamic nature of educational contexts. [This article was published in “Educational Researcher” v49 n4 p285-288 2020.]

Miller, L. C., Shaikh, S. J., Jeong, D. C., Wang, L., Gillig, T. K., Godoy, C. G., Appleby, P. R., Corsbie-Massay, C. L., Marsella, S., Christensen, J. L., & Read, S. J. (2019). Causal Inference in Generalizable Environments: Systematic Representative Design. Psychological Inquiry, 30(4), 173–202. Psychology Database.

  • Causal inference and generalizability both matter. Historically, systematic designs emphasize causal inference, while representative designs focus on generalizability. Here, we suggest a transformative synthesis - Systematic Representative Design (SRD) - concurrently enhancing both causal inference and “built-in” generalizability by leveraging today’s intelligent agent, virtual environments, and other technologies. In SRD, a “default control group” (DCG) can be created in a virtual environment by representatively sampling from real-world situations. Experimental groups can be built with systematic manipulations onto the DCG base. Applying systematic design features (e.g., random assignment to DCG versus experimental groups) in SRD affords valid causal inferences. After explicating the proposed SRD synthesis, we delineate how the approach concurrently advances generalizability and robustness, cause-effect inference and precision science, a computationally-enabled cumulative psychological science supporting both “bigger theory” and concrete implementations grappling with tough questions (e.g., what is context?) and affording rapidly-scalable interventions for real-world problems.

Olsen, R. B., Bell, S. H., & Nichols, A. (2018). Using preferred applicant random assignment (PARA) to reduce randomization bias in randomized trials of discretionary programs. Journal of Policy Analysis and Management, 37(1), 167–180.

  • Randomization bias occurs when the random assignment used to estimate program effects influences the types of individuals that participate in a program. This paper focuses on a form of randomization bias called “applicant inclusion bias,” which can occur in evaluations of discretionary programs that normally choose which of the eligible applicants to serve. If this nonrandom selection process is replaced by a process that randomly assigns eligible applicants to receive the intervention or not, the types of individuals served by the program-and thus its average impact on program participants-could be affected.
  • To estimate the impact of discretionary programs for the individuals that they normally serve, we propose an experimental design called Preferred Applicant Random Assignment (PARA). Prior to random assignment, program staff would identify their “preferred applicants,” those that they would have chosen to serve. All eligible applicants are randomly assigned, but the probability of assignment to the program is set higher for preferred applicants than for the remaining applicants.
  • This paper demonstrates the feasibility of the method, the cost in terms of increased sample size requirements, and the benefit in terms of improved generalizability to the population normally served by the program. [ABSTRACT FROM AUTHOR]

Schochet, P. Z., Puma, M., Deke, J., & National Center for Education Evaluation and Regional Assistance (ED). (2014). Understanding Variation in Treatment Effects in Education Impact Evaluations: An Overview of Quantitative Methods. NCEE 2014-4017. In National Center for Education Evaluation and Regional Assistance (National Center for Education Evaluation and Regional Assistance. Available from: ED Pubs. P.O. Box 1398, Jessup, MD 20794-1398. Tel: 877-433-7827; Web site: http://ies.ed.gov/ncee/). National Center for Education Evaluation and Regional Assistance; ERIC.

  • This report summarizes the complex research literature on quantitative methods for assessing how impacts of educational interventions on instructional practices and student learning differ across students, educators, and schools. It also provides technical guidance about the use and interpretation of these methods. The research topics addressed include: subgroup (moderator) analyses based on study participants’ characteristics measured “before” the intervention is implemented; subgroup analyses based on study participants’ experiences, mediators, and outcomes measured “after” program implementation; and impact estimation when treatment effects vary. The focus is on randomized controlled trials, but the methods are also applicable to quasi-experimental designs. [This report was prepared for the Institute of Education Sciences (IES) by Decision Information Resources, Inc. under Contract ED-IES-12-C-0057, Analytic Technical Assistance and Development.]

Shaw, S. R., & D’Intino, J. (2017). Evidence-Based Practice and the Reproducibility Crisis in Psychology. Communique, 45(5), 1–21. ERIC.

  • Evidence-based practice (EBP) is the norm and expectation for providing interventions in school psychology. However, there are two major hurdles to be addressed before EBP can be a true improvement in providing educational and psychological interventions to children and families. The first challenge is that the standard of clinical research limits the generalizability or application of findings to any given situation as nearly all research, including clinical research, is not replicated or reproduced; which limits the implementation of any intervention that might be based upon that research. A second challenge emerges from the first, as clearer standards need to be established regarding the implementation of EBP to allow for the proper integration of research, clinical judgment, client experience and need, and local or cultural constraints.
  • These challenges can be addressed through a concerted effort and partnership by researchers and clinicians with the goal of improving child and family outcomes. The first steps towards doing so are defining new standards for reproducibility and replication in clinical research, as well as EBP, which can then contribute to a balanced implementation approach.

Smith, N. L., & Caulley, D. N. (1979). Post-Evaluation Determination of a Program’s Generalizability. Evaluation and Program Planning: An International Journal, 2(4), 297–302. ERIC.

  • Literature on the generalizability of program effects focuses on the a priori development of evaluation designs which enable certain generalizations to be made. Secondary analysis procedures which can be employed using existing evaluation data to estimate a program’s generalizability when follow-up field studies are not feasible are suggested. (Author/RL)

Smith Slep, A. M., Heyman, R. E., Williams, M. C., Van Dyke, C. E., & O’Leary, S. G. (2006). Using Random Telephone Sampling to Recruit Generalizable Samples for Family Violence Studies: JFP. Journal of Family Psychology, 20(4), 680. Psychology Database.

  • Convenience sampling methods predominate in recruiting for laboratory-based studies within clinical and family psychology. The authors used random digit dialing (RDD) to determine whether they could feasibly recruit generalizable samples for 2 studies (a parenting study and an intimate partner violence study).
  • RDD screen response rate was 42-45%; demographics matched those in the 2000 U.S. Census, with small- to medium-sized differences on race, age, and income variables. RDD respondents who qualified for, but did not participate in, the laboratory study of parents showed small differences on income, couple conflicts, and corporal punishment. Time and cost are detailed, suggesting that RDD may be a feasible, effective method by which to recruit more generalizable samples for in-laboratory studies of family violence when those studies have sufficient resources. [PUBLICATION ABSTRACT]

Thompson, A. J., & Pickett, J. T. (2020). Are Relational Inferences from Crowdsourced and Opt-in Samples Generalizable? Comparing Criminal Justice Attitudes in the GSS and Five Online Samples. Journal of Quantitative Criminology, 36(4), 907–932. Academic Search Alumni Edition.

  • Objectives: Similar to researchers in other disciplines, criminologists increasingly are using online crowdsourcing and opt-in panels for sampling, because of their low cost and convenience. However, online non-probability samples’ “fitness for use” will depend on the inference type and outcome variables of interest. Many studies use these samples to analyze relationships between variables. We explain how selection bias—when selection is a collider variable—and effect heterogeneity may undermine, respectively, the internal and external validity of relational inferences from crowdsourced and opt-in samples. We then examine whether such samples yield generalizable inferences about the correlates of criminal justice attitudes specifically.
  • Methods: We compare multivariate regression results from five online non-probability samples drawn either from Amazon Mechanical Turk or an opt-in panel to those from the General Social Survey (GSS). The online samples include more than 4500 respondents nationally and four outcome variables measuring criminal justice attitudes. We estimate identical models for the online non-probability and GSS samples.
  • Results: Regression coefficients in the online samples are normally in the same direction as the GSS coefficients, especially when they are statistically significant, but they differ considerably in magnitude; more than half (54%) fall outside the GSS’s 95% confidence interval.
  • Conclusions: Online non-probability samples appear useful for estimating the direction but not the magnitude of relationships between variables, at least absent effective model-based adjustments. However, adjusting only for demographics, either through weighting or statistical control, is insufficient. We recommend that researchers conduct both a provisional generalizability check and a model-specification test before using these samples to make relational inferences. [ABSTRACT FROM AUTHOR] Tags: Other

Tipton, E. A., & Hartman, E. (Working paper). Generalizability and Transportability. In E. Stuart, P. Rosenbaum, D. Small, & J. Zubizarreta (Eds.), Handbook of Multivariate Matching and Weighting.

[No Abstract]

Tipton, E., Fellers, L., Caverly, S., Vaden-Kiernan, M., Borman, G., Sullivan, K., & Ruiz de Castillo, V. (2015). Site Selection in Experiments: A Follow-Up Evaluation of Site Recruitment in Two Scale-Up Studies. Society for Research on Educational Effectiveness.

  • Randomized experiments are commonly used to evaluate if particular interventions improve student achievement. While these experiments can establish that a treatment actually “causes” changes, typically the participants are not randomly selected from a well-defined population and therefore the results do not readily generalize.
  • Three streams of research methodologies have been developed to improve generalizations from large-scale experiments: (1) “assessing” the degree of similarity between the convenience sample of schools or districts in a completed experiment (e.g., Stuart, Cole, Bradshaw, & Leaf, 2011; Olsen, Orr, Bell, & Stuart, 2013; Tipton, in press); (2) “reweighting” this convenience sample to be more similar to one or more well-defined inference populations (e.g., O’Muircheartaigh & Hedges, 2014; Tipton, 2013); and (3) improvements through design and improved recruitment strategies (e.g. Tipton et al, 2014; Tipton, 2014; Roschelle et al, 2014). Tipton et al (2014) provide a design-based approach that uses propensity score methodology to first compare an inference population to those eligible for recruitment in the experiment, and then creates strata for site-selection. The goal is to help recruiters create a recruitment strategy that is targeted and, that when perfectly implemented, results in a sample of sites that is like a miniature of the inference population of interest.
  • This paper is a follow up study to the examples proposed and carried out in Tipton et al (2014), with the goal of evaluating the success of these methods in practice, as well as addressing additional problems that arose in recruitment. One figure and three tables are appended.

Tipton, E., Hallberg, K., Hedges, L. V., Chan, W., & Society for Research on Educational Effectiveness (SREE). (2015). Implications of Small Samples for Generalization: Adjustments and Rules of Thumb. In Society for Research on Educational Effectiveness (Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; Fax: 202-640-4401; e-mail: inquiries@sree.org; Web site: http://www.sree.org). Society for Research on Educational Effectiveness; ERIC.

  • Policy-makers are frequently interested in understanding how effective a particular intervention may be for a specific (and often broad) population. In many fields, particularly education and social welfare, the ideal form of these evaluations is a large-scale randomized experiment.
  • Recent research has highlighted that sites in these large-scale experiments are typically not randomly sampled from the population, making generalizations difficult. A problem not addressed by this literature is the effect of “small” sample sizes in generalization.
  • This paper addresses three questions regarding the effect of small sample sizes on: (1) assessments of generalizability; (2) rules of thumb for covariate balance; and (3) properties of estimators and estimation strategies. The authors compare results from rare-events logistic progression (RE) and standard-logistic regression to determine if and when small sample corrections matter. This study investigates these issues in relation to sample sizes that vary from 30 to 70 clusters and on studies that are cluster-randomized or multi-site (random block) in design. Data examined were drawn from a cluster randomized controlled trial (Konstantopoulos, Miller, and Van der Ploeg, 2013) that was designed to study the effect of Indiana’s benchmark assessment system on student achievement in mathematics and English Language Arts (ELA) based on annual Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) scores. Fifty-six K-8 schools volunteered to implement the system in the 2009-10 school year. Of these, 34 were randomly assigned to the state’s benchmark assessment system while 22 served as controls. Data from the experiment were supplemented by data on all of the other K-8 schools in the state of Indiana, which were used to define the inference population.
  • Based on simulation results, findings include: (1) The standardized mean difference (|SMD|) for the RE-logits were typically much smaller than those for the logits, and more importantly, were in line with the |SMD|s for the individual covariates; (2) The degree of imbalance between a sample and population is much larger under random-sampling than would be expected by the rules of thumb commonly in place in propensity score methods; and (3) The problem of small sample sizes limiting the number of equal-populations strata possible in generalization is likely to arise simply by a change in random samples. Propensity score matching methods can be used to improve generalizability of findings from randomized experiments with non-probability samples, but adjustments and new rules of thumb are necessary in the application of these methods in this context. Tables and figures are appended.

Tipton, E., & Hedges, L. V. (2017). The role of the sample in estimating and explaining treatment effect heterogeneity. Journal of Research on Educational Effectiveness, 10(4), 903–906.

  • The article discusses the role of the sample in estimating and explaining treatment effect heterogeneity. Topics discussed include collection of samples for analyzing population average treatment effect; analyzing samples in elementary schools; and variation in key demographics while collecting the samples.

Tipton, E., Sullivan, K., Hedges, L., Vaden-Kiernan, M., Borman, G., Caverly, S., & Society for Research on Educational Effectiveness (SREE). (2011). Designing a Sample Selection Plan to Improve Generalizations from Two Scale-Up Experiments. Society for Research on Educational Effectiveness (Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; Fax: 202-640-4401; e-mail: inquiries@sree.org; Web site: http://www.sree.org). Society for Research on Educational Effectiveness; ERIC.

  • In this paper the authors present a new method for sample selection for scale-up experiments. This method uses propensity score matching methods to create a sample that is similar in composition to a well-defined generalization population. The method they present is flexible and practical in the sense that it identifies units to be targeted for recruitment, and when they are not available, identifies similar units for replacement. Additionally, this method helps researchers determine which areas of the population may be most difficult to recruit from, enabling resources to be allocated accordingly. (Contains 2 tables and 1 figure.)

Woodhead, M. (1985). Pre-School Education Has Long-Term Effects: But Can They Be Generalised? Oxford Review of Education, 11(2), 133–155. ERIC.

  • British and U.S. preschool intervention projects are now reporting dramatic long-term follow-up findings that appear to vindicate the claim that preschool can serve as an “innoculation against failure,” especially with disadvantaged children. However, important questions remain about the generalizability of these effects in other cultural settings. (Author/RM)