ESRA logo

ESRA 2023 Glance Program

All time references are in CEST

Approximating Probability Samples in the Absence of Sampling Frames 3

Session Organisers Dr Carina Cornesse (German Institute for Economic Research)
Dr Mariel McKone Leonard (DeZIM Institute)
TimeThursday 20 July, 14:00 - 15:30
Room U6-22

Research shows that survey samples should be constructed using probability sampling approaches to allow valid inference to the intended target population. However, for many populations of interest high-quality probability sampling frames do not exist. This is particularly true for marginalized and hidden populations, including ethnic, religious, and sexual minorities. In the absence of sampling frames, researchers are faced with the choice to discard their research questions or to try to draw inferences from nonprobability and other less conventional samples.

For the latter, both model-based and design-based solutions have been proposed in recent years. This session focuses on data collection techniques designed to result in samples that approximate probability samples. We also invite proposals on techniques for approximating probability samples using already collected nonprobability sample data as well as by combining probability and nonprobability sample data for drawing inferences. The session scope covers but is not limited to research on hard-to-reach and hard-to-survey populations. We are particularly interested in methodological research on techniques such as

- Respondent-driven sampling (RDS) & other network sampling techniques
- Quasi-experimental research designs
- Weighting approaches for nonprobability data (especially those that make use of probability sample reference survey data)
- Techniques for combining probability and nonprobability samples (e.g. blended calibration)

Keywords: nonprobability sample, respondent-driven sampling, blended calibration, weighting, data integreation


Enhancing Model-Based Adjustments of Nonprobability Surveys: Selecting Auxiliary Variables Based on Theoretical Assumptions about their Association with Survey Participation and Variables of Interest

Ms Hannah Bucher (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Dr Joss Roßmann (GESIS - Leibniz Institute for the Social Sciences)

Nonprobability surveys have become increasingly popular for social science research based on observational data. Since nonprobability samples are based on a non-random (self-)selection of respondents, we need powerful model-based adjustments to reduce selection bias in surveys based on nonprobability samples. However, the auxiliary variables used for adjustments are frequently limited to a small subset of sociodemographic variables. Recent research suggests these variables hardly help minimize bias in nonprobability surveys. One proposed explanation for the often-poor performance of survey adjustments is the weak correlation between survey participation, variables of interest, and the auxiliary variables used for adjustments.

In our study, we argue that survey researchers should select and include questions on auxiliary variables in their surveys based on theoretical assumptions about the links to survey participation and substantive variables of interest. We conducted two (preregistered) studies with data from surveys on political attitudes and behavior in Germany. The respondents for these surveys were selected from a nonprobability opt-in online panel using quotas on sex, age, and education. We included auxiliary variables from different domains for which external benchmark data were available to compute adjustment weights that - from a theoretical perspective - are likely correlated with both 1) substantive political attitudes and behaviors and 2) survey participation. Comparing estimates of electoral behaviors to official statistics on the election outcome across different adjustments, we found that the approach is promising for reducing selection biases in nonprobability surveys.

The Potential of Respondent-Driven Sampling (RDS) in Survey Practice: Who is Willing to Recruit?

Dr Carina Cornesse (German Institute for Economic Research) - Presenting Author
Dr Jean-Yves Gerlitz (University of Bremen)
Professor Olaf Groh-Samberg (University of Bremen)
Professor Sabine Zinn (German Institute for Economic Research)

RDS is a popular network sampling strategy with many advantages. Its general idea is that researchers can select a set of survey respondents (the so-called “seeds”) and encourage them to recruit some of their social network members to the survey, who then again recruit some of their network members, and so on (the so-called “referral chain”). The desired result is a large and diverse dataset in which each person can be traced back to their initial seed.

In theory, RDS has the capability of generating samples which allow valid inference to a researcher’s population of interest without requiring traditional sampling frames. In practice, however, the methodology often fails to generate large and diverse enough datasets to be able to do so. One common challenge is that many selected seeds do not even start recruiting network members, so that referral chains are not initiated. This has particularly been observed for research contexts in which seeds are selected from probability sample surveys and/or where the goal is to draw inferences to a broad and heterogeneous population.

Our study explores the potential of RDS procedures based on general population probability-based survey samples. Questions we address include: To what extent are survey respondents willing to recruit their social network members? And how do survey respondents who are willing to do so differ from everyone else? To answer these questions, we selected a random subset of respondents from the newly established German Social Cohesion Panel (n≈1,600; stratified by age) and asked them for their hypothetical willingness to recruit 3 members of their social network for a survey. Preliminary analyses suggest that for some general population subgroups (e.g. younger adults) RDS may be quite promising while other subgroups (e.g. older adults) are either undecided or hesitant to engage in RDS.

Integrating probability and non-probability samples to improve analytic inference and reduce costs

Ms Camilla Salvatore (Utrecht University) - Presenting Author
Ms Silvia Biffignandi (CESS)
Mr ‪Joseph Sakshaug (German Institute for Employment Research (IAB))
Mr Arkadiusz Wiśniowski (University of Manchester)
Ms Bella Struminskaya (Utrecht University)

Probability sample surveys, which are the gold standard for population inference, are facing difficulties due to declining response rates and related increasing costs. Fielding large size probability samples can be cost prohibitive for many survey researchers and study sponsors. Thus, moving towards less expensive, but potentially biased, non-probability sample surveys is becoming a more common practice.
In this paper, we propose a novel methodology to combine both samples in a manner that overcomes their respective disadvantages while also reducing survey costs. The focus of our work is on analytic inference, which is a topic rarely addressed in the literature. In particular, we present a novel Bayesian approach to integrate a small probability sample with a larger online non-probability sample (possibly affected by selection bias) to improve inferences about logistic regression coefficients and reduce survey costs.
In this approach, inference is based on a (small) probability sample survey and supplementary auxiliary information from a parallel non-probability sample survey is provided naturally through the prior structure. We evaluate the performance of several strongly-informative priors constructed from the non-probability sample information through a simulation study and real-data application. We show that the proposed method reduces the mean-squared error of regression coefficients with respect to the analysis of a small probability-only sample size. Through the analysis of real data, we show that cost savings are possible. This work is supplemented through a Shiny web app where interactive cost-analyses can be performed as guidance for survey practitioners that are interested in applying the method.

Statistical inference with non-probability survey samples with misclassification in all variables

Dr Maciej Beręsewicz (Poznań University of Economics and Business / Statistical Office in Poznań) - Presenting Author
Mr Łukasz Chrostowski (Adam Mickiewicz University, Poznań)

Non-probability samples (including big data sources) have become a hot topic in survey statistics. Researchers propose data integration methods such as propensity score weighting (PS), mass imputation (MI) or doubly robust estimators based on PS and MI. However, the majority of these studies assume that variables from non-probability samples or big data sources are measured without error. This is often not true; for instance online job offers do not contain occupation codes (e.g. ESCO/ISCO) or background information about smartphone users is unknown and profiling methods are applied.

To tackle this problem, we propose new estimators assuming that all variables from non-statistical data sources are misclassified. We consider two cases when: 1) a validation study is available at unit-level, 2) only estimates of misclassification matrices are available for each variable separately. Our approach is based on multiple imputation, the MCSIMEX approach and other methods proposed in the literature on validation studies. We compare our estimators with existing ones in a simulation study and a real-life application presented below.

We utilize mobile big data sources to measure the length of stay of foreigners in Poland. The dataset used in the study contains data from over 30 mln smartphone users in Poland (almost complete coverage) collected from advertising systems by Selectivv company. In contrast to Call Details Records (CDR) or signaling data from mobile network providers, Selectivv data contain socio-demographic information about phone users. While this sounds interesting, background information regarding sex, age, nationality or the length of stay is derived by applying rule-based or machine learning (ML) algorithms, which introduce measurement error. We conducted a validation study using a random sample survey of 500 smartphone users to correct misclassification errors and then correct for selection / coverage error.