Sample Selection Methods


Olsen, R. B. (2022). Using Survey Data to Obtain More Representative Site Samples for Impact Studies. ArXiv Preprint ArXiv:2201.05221.

  • To improve the generalizability of impact evaluations, recent research has examined statistical methods for selecting representative samples of sites. However, these methods rely on having rich data on impact moderators for all sites in the target population.
  • This paper offers a new approach to selecting sites for impact studies when rich data on impact moderators are available, but only from a survey based on a representative sample of the impact study’s target population. Survey data are used to (1) estimate the proportion of sites in the population with certain characteristics, and (2) set limits on the number of sites with different characteristics that the sample can include. The Principal Investigator enforces the limits to ensure that certain types of sites are not overrepresented in the final sample. These limits can be layered on top of site selection and recruitment approaches to improve the representativeness of the sample.

[Olsen, R. B., & Orr, L. L. (2016)](). On the “where” of social experiments: Selecting more representative samples to inform policy. New Directions for Evaluation, 2016(152), 61–71.

[No Abstract]

Roschelle, J., Tatar, D., Hedges, L., Shechtman, N., & Society for Research on Educational Effectiveness (SREE). (2010). Two Perspectives on the Generalizability of Lessons from Scaling Up SimCalc. In Society for Research on Educational Effectiveness (Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; Fax: 202-640-4401; e-mail: inquiries@sree.org; Web site: http://www.sree.org). Society for Research on Educational Effectiveness; ERIC.

  • One purpose of educational research is to provide information about the likely impact of interventions or treatments on policy-relevant populations of students. Randomized experiments are useful for estimating the causal effects of interventions on the students in schools that participate in the experiments.
  • Unfortunately, the samples of schools and students participating in experiments are typically not probability (random) samples. Thus, even well-conducted experiments may not yield results that generalize to populations of interest. In the Scaling Up SimCalc experiments, one concern about the sample is that teachers were volunteers and potentially not representative of a broader teaching population. Although the volunteer teachers were randomly assigned to condition (reducing the chance that results were due to selection bias), the properties of the volunteer pool as a whole might limit generalizability to broader or differently-selected populations
  • A second concern is that, because pragmatic issues unrelated to sampling led to recruitment in regions with high proportions of Hispanic and Caucasian students and teachers, other groups of interest, such as African-American students and teachers, were underrepresented in the studies. In light of these and other concerns, this paper examines generalizability from two complementary perspectives. First, the authors have conducted detailed analyses of the characteristics of teachers and schools participating in the sample in comparison to others in the state in which the experiments took place. Second, they present findings from a novel statistical method developed to permit principled generalization from research samples to well-defined populations. The studies took place during the 2005-06 and 2006-07 school years in 115 middle schools throughout several geographic regions across the state of Texas.
  • This research has led the authors to propose the use of complementary approaches for examining generalizability. A foundational approach is to start out with the best sampling procedure possible, striving to achieve a broad and representative sample. The authors provided their recruiters with randomized lists of schools, but found that the actual schools they contacted reflected a tension between random selection and convenience. They complemented this procedure with two additional analyses. The first found some ways in which their samples do not reflect the full diversity of Texas; in particular they did not have large urban districts and did not have many African American participants. However, with regard to many other characteristics, their sample is not systematically different from the full population in the state of Texas. Their hypothesis is that the second analysis will predict positive effects for all populations of interest, but with wider confidence intervals for populations that were undersampled. Thus, they should have good confidence in how their results generalize to Hispanic schools but less confidence as to how the generalize to African American schools. Overall, this affects how they share the results of their research with the practitioner community. (Contains 4 tables and 3 figures.)

Tipton, E. (2014). Stratified Sampling Using Cluster Analysis: A Sample Selection Strategy for Improved Generalizations From Experiments. Evaluation Review, 38(2), 109. ProQuest Central.

  • An important question in the design of experiments is how to ensure that the findings from the experiment are generalizable to a larger population. This concern with generalizability is particularly important when treatment effects are heterogeneous and when selecting units into the experiment using random sampling is not possible – two conditions commonly met in large-scale educational experiments.
  • This article introduces a model-based balanced-sampling framework for improving generalizations, with a focus on developing methods that are robust to model misspecification. Additionally, the article provides a new method for sample selection within this framework: First units in an inference population are divided into relatively homogenous strata using cluster analysis, and then the sample is selected using distance rankings. In order to demonstrate and evaluate the method, a reanalysis of a completed experiment is conducted. This example compares samples selected using the new method with the actual sample used in the experiment. Results indicate that even under high nonresponse, balance is better on most covariates and that fewer coverage errors result. The article concludes with a discussion of additional benefits and limitations of the method.

Tipton, E. (2021). Beyond generalization of the ATE: Designing randomized trials to understand treatment effect heterogeneity. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(2), 504–521.

  • Researchers conducting randomized trials have increasingly shifted focus from the average treatment effect to understanding moderators of treatment effects. Current methods for exploring moderation focus on model selection and hypothesis tests. At the same time, recent developments in the design of randomized trials have argued for the need for population-based recruitment in order to generalize well.
  • In this paper, we show that a different population-based recruitment strategy can be implemented to increase the precision of estimates of treatment effect moderators, and we explore the trade-offs between optimal designs for the average treatment effect and moderator effects.

Tipton, E. (2022). Sample selection in randomized trials with multiple target populations. American Journal of Evaluation, 1098214020927787.

  • Practitioners and policymakers often want estimates of the effect of an intervention for their local community, e.g., region, state, county. In the ideal, these multiple population average treatment effect (ATE) estimates will be considered in the design of a single randomized trial. Methods for sample selection for generalizing the sample ATE to date, however, focus only on the case of a single target population.
  • In this paper, I provide a framework for sample selection in the multiple population case, including three compromise allocations. I situate the methods in an example and conclude with a discussion of the implications for the design of randomized evaluations more generally.

Tipton, E., Fellers, L., Caverly, S., Vaden-Kiernan, M., Borman, G., Sullivan, K., & Ruiz de Castilla, V. (2016). Site Selection in Experiments: An Assessment of Site Recruitment and Generalizability in Two Scale-Up Studies. Journal of Research on Educational Effectiveness, 9, 209–228. ERIC.

  • Recently, statisticians have begun developing methods to improve the generalizability of results from large-scale experiments in education. This work has included the development of methods for improved site selection when random sampling is infeasible, including the use of stratification and targeted recruitment strategies. This article provides the next step in this literature–a template for assessing generalizability after a study is completed.
  • In this template, first records from the recruitment process are analyzed, comparing differences between those who agreed to be in the study and those who did not. Second, the final sample is compared to the original inference population and different possible subsets, with the goal of determining where the results best generalize (and where they do not). Throughout, these methods are situated in the post hoc analysis of results from two scale-up studies. The article ends with a discussion of the use of these methods more generally when reporting results from randomized trials.

Tipton, E., Hallberg, K., Hedges, L. V., & Chan, W. (2017). Implications of Small Samples for Generalization: Adjustments and Rules of Thumb. Evaluation Review, 41(5), 472–505. Psychology Database.

  • Background: Policy makers and researchers are frequently interested in understanding how effective a particular intervention may be for a specific population. One approach is to assess the degree of similarity between the sample in an experiment and the population. Another approach is to combine information from the experiment and the population to estimate the population average treatment effect (PATE).
  • Method: Several methods for assessing the similarity between a sample and population currently exist as well as methods estimating the PATE. In this article, we investigate properties of six of these methods and statistics in the small sample sizes common in education research (i.e., 10-70 sites), evaluating the utility of rules of thumb developed from observational studies in the generalization case.
  • Result: In small random samples, large differences between the sample and population can arise simply by chance and many of the statistics commonly used in generalization are a function of both sample size and the number of covariates being compared. The rules of thumb developed in observational studies (which are commonly applied in generalization) are much too conservative given the small sample sizes found in generalization.
  • Conclusion: This article implies that sharp inferences to large populations from small experiments are difficult even with probability sampling. Features of random samples should be kept in mind when evaluating the extent to which results from experiments conducted on nonrandom samples might generalize.

Tipton, E., Hedges, L., Vaden-Kiernan, M., Borman, G., Sullivan, K., & Caverly, S. (2014). Sample selection in randomized experiments: A new method using propensity score stratified sampling. Journal of Research on Educational Effectiveness, 7(1), 114–135.

  • Randomized experiments are often seen as the “gold standard” for causal research. Despite the fact that experiments use random assignment to treatment conditions, units are seldom selected into the experiment using probability sampling. Very little research on experimental design has focused on how to make generalizations to well-defined populations or on how units should be selected into an experiment to facilitate generalization.
  • This article addresses the problem of sample selection in experiments by providing a method for selecting the sample so that the population and sample are similar in composition. The method begins by requiring that the inference population and eligibility criteria for the study are well defined before study recruitment begins. When the inference population and population of eligible units differs, the article provides a method for sample recruitment based on stratified selection on a propensity score. The article situates the problem within the example of how to select districts for two scale-up experiments currently in recruitment.

Tipton, E., & Matlen, B. J. (2019). Improved generalizability through improved recruitment: Lessons learned from a large-scale randomized trial. American Journal of Evaluation, 40(3), 414–430.

  • Randomized control trials (RCTs) have long been considered the “gold standard” for evaluating the impacts of interventions. However, in most education RCTs, the sample of schools included is recruited based on convenience, potentially compromising a study’s ability to generalize to an intended population.
  • An alternative approach is to recruit schools using a stratified recruitment method developed by Tipton. Until now, however, there has been limited information available about how to implement this approach in the field.
  • In this article, we concretely illustrate each step of the stratified recruitment method in an evaluation of a college-level developmental algebra intervention. We reflect on the implementation of this process and conclude with five on-the-ground lessons regarding how to best implement this recruitment method in future studies.

Tipton, E., & Peck, L. R. (2017). A design-based approach to improve external validity in welfare policy evaluations. Evaluation Review, 41(4), 326–356.

  • Background: Large-scale randomized experiments are important for determining how policy interventions change average outcomes. Researchers have begun developing methods to improve the external validity of these experiments. One new approach is a balanced sampling method for site selection, which does not require random sampling and takes into account the practicalities of site recruitment including high nonresponse.
  • Method: The goal of balanced sampling is to develop a strategic sample selection plan that results in a sample that is compositionally similar to a well-defined inference population. To do so, a population frame is created and then divided into strata, which “focuses” recruiters on specific subpopulations. Units within these strata are then ranked, thus identifying “replacements” similar to sites that can be recruited when the ideal site refuses to participate in the experiment.
  • Result: In this article, we consider how a balanced sample strategic site selection method might be implemented in a welfare policy evaluation.
  • Conclusion: We find that simply developing a population frame can be challenging, with three possible and reasonable options arising in the welfare policy arena. Using relevant study-specific contextual variables, we craft a recruitment plan that considers nonresponse.

Tipton, E., Yeager, D. S., Iachan, R., & Schneider, B. (2019). Designing probability samples to study treatment effect heterogeneity. Experimental Methods in Survey Research: Techniques That Combine Random Sampling with Random Assignment, 435–456.

  • This chapter explains a new approach that survey samplers can use when designing probability samples for survey experiments where there is a possibility of treatment heterogeneity. It begins by explaining why probability samples are preferred to nonprobability samples for estimating two quantities (or estimands): population average treatment effects and treatment effects within subgroups. The chapter furthermore explains why typical probability sampling methods that optimize statistical power for the average effect in a population do not necessarily optimize statistical power for the subgroup effects of interest – especially when one-s interest is in estimating effects within a rare subgroup. Next, it explains why even large, well-constructed, highly representative probability samples with randomized treatments can produce confounded analyses of differences across subgroups.
  • The chapter illustrates the proposed approach using an empirical case study of a survey-administered behavioral science intervention: The US National Study of Learning Mindsets.