ESRA 2019 Programme at a Glance

Predictive Modeling and Machine Learning in Survey Research 3

Session Organisers Mr Christoph Kern (University of Mannheim)
Mr Ruben Bach (University of Mannheim)
Mr Malte Schierholz (University of Mannheim, Institute for Employment Research (IAB))
TimeThursday 18th July, 16:00 - 17:30
Room D24

Advances in the field of machine learning created an array of flexible methods for exploring and analyzing diverse data. These methods often do not require prior knowledge about the functional form of the relationship between the outcome and its predictors while focusing specifically on prediction performance. Machine learning tools thereby offer promising advantages for survey researchers to tackle emerging challenges in data analysis and collection and also open up new research perspectives.

On the one hand, utilizing new forms of data gathering, e.g. via mobile web surveys, sensors or apps, often results in (para)data structures that might be difficult to handle -- or fully utilize -- with traditional modeling methods. This might also be the case for data from other sources such as panel studies, in which the wealth of information that accumulates over time induces challenging modeling tasks. In such situations, data-driven methods can help to extract recurring patterns, detect distinct subgroups or explore non-linear and non-additive effects.

On the other hand, techniques from the field of supervised learning can be used to inform or support the data collection process itself. In this context, various sources of survey errors may be thought of as constituting prediction problems which can be used to develop targeted interventions. This includes e.g. predicting noncontact, nonresponse or break-offs in surveys to inform adaptive designs that aim to prevent these outcomes. Machine learning provides suitable tools for building such prediction models.

This session welcomes contributions that utilize machine learning methods in the context of survey research. The aim of the session is to showcase the potential of machine learning techniques as a complement and extension to the survey researchers' toolkit in an era of new data sources and challenges for survey science.

Keywords: machine learning, predictive models, data science

Machine Learning in Data Analysis for Social Research

Dr Arne Bethmann (Max-Planck-Institute for Social Law and Social Policy) - Presenting Author
Dr Jonas Beste (Institute for Employment Research)
Mr Giuseppe Casalicchio (Ludwig-Maximilians-Universität München)
Professor Bernd Bischl (Ludwig-Maximilians-Universität München)

Download presentation

Published on Zenodo with DOI:

With the increasing availability of both new large scale data sources as well as computational power, specific algorithms for the analysis of these data have seen widespread use and many refinements. Certain major developments in this area stem from computer science and computational statistics and are known as machine / statistical learning approaches. While these algorithms are widely used for predictive purposes their application to statistical inference is rather scarce. In comparison to traditional modelling approaches in the social sciences, especially (generalized) regression models, machine learning algorithms often allow for more flexible modelling strategies. Depending on the specific algorithm there are additional desirable features, for example automatic variable selection or implicit specification of interaction effects.

Research questions in the social sciences usually aim at population inference for effects of certain variables of interest rather than prediction. Therefore a way to interpret the resulting machine learning models has to be developed if they are to be applied in social research. We use an approach adapting the idea of average marginal effects in order to estimate marginal and partial associations between response and predictors from a variety of classical machine learning algorithms. Whether these can be used to draw valid statistical or even causal inference is debatable and very much dependent on the specific data structure, modelling approach, and substantial research question at hand.

The core estimation functionality has been implemented as an R-Package mimicking the features of STATA’s margins command, but using trained learners, e.g., from the extensive "Machine Learning in R" (mlr) package as a basis for the underlying predictions. Confidence intervals are provided via bootstrapping. While the package is still under development we will showcase the current functionality and provide analysis examples for social research questions.

Automated Model Selection for Imputing Missing Values in High-Dimensional (Survey) Data

Mr Micha Fischer (University of Michigan) - Presenting Author

Multiple sequential imputation is often used to impute missing values in data sets. The procedure leads to unbiased results if the missing data is missing at random and models are correctly specified. However, in data sets where many variables are affected by missing values, proper specifications of sequential regression models can be burdensome and time consuming, as a separate model needs to be developed by a human imputer for each variable. Even available software packages for automated imputation procedures (e.g. MICE, IVEware) require model specifications for each variable containing missing values. Additionally, their default models can lead to bias in imputed values, for example when variables are non-normally distributed.
This research aims to automate the process of sequential imputation of missing values in high-dimensional data sets consisting of potentially non-normally distributed variables and potentially complex and non-linear interactions. The proposed algorithm to achieve this goal modifies the sequential imputation procedure. At first, model specification via an automated variable selection procedure (e.g. adaptive LASSO, elastic net) is performed. Second, the process carries out model selection from a pool of several supervised learning techniques (parametric and non-parametric models) in each step of the sequential imputation procedure. The selected imputation model for an outcome variable achieves the highest similarity between imputed and observed values after conditioning on the response propensity score for the outcome variable.
The evaluation of the proposed method takes places in two different ways: a simulation study investigates in which situations this automated procedure can outperform the usual approaches (MICE, IVEware, human imputer). Additionally, this method is assessed on survey data linked to administrative records by comparing results from imputed survey data with gold standard estimates from complete administrative data sources.

Using Machine Learning Algorithms to Refine Techniques for Volunteering Bias Treatment on Addiction Surveys

Mr Ramón Ferri-García (University of Granada) - Presenting Author
Professor María del Mar Rueda (University of Granada)
Dr Francisca López-Torrecillas (University of Granada)

The development of new survey data collection methods, such as online surveys, has been particularly advantageous for addiction studies. As they usually focus on sensible topics which require to sample individuals who are hard to reach, their low costs and immediateness offer critical advantages over traditional data collection methods. However, online surveys are deeply affected by non-observational bias, specially coverage and volunteering bias, and as a result the estimates that they provide may have a low reliability. Calibration (Deville and Särndal, 1992) and Propensity Score Adjustment or PSA (Rosenbaum and Rubin, 1983; Lee, 2006) have been proposed as methods to remove both sources of bias in non-probabilistic surveys, although they suffer from several drawbacks. Calibration often requires population totals to be known for the auxiliary variables used on the procedure, while PSA estimates the volunteering propensity of an individual using logistic regression, which may be troublesome in the online survey context due to the high dimensionality of data.

This study presents an application of volunteering bias correction techniques in addiction studies, using a volunteer online sample and a probabilistic reference sample on a population of university students from Granada (Spain). The volunteer sample is used to obtain mobile phone addiction (nomophobia) patterns estimates for the entire target population, adjusting it for removing volunteering bias with PSA and further calibration. PSA is applied with Machine Learning classification algorithms as an alternative to logistic regression, and the variables to be included in the models are selected using feature selection algorithms, which are also an alternative to StepWise algorithm for logistic regression. Raking calibration is performed using population known totals. Results obtained with this procedure provided more reliable estimates on nomophobia patterns for the university students population.

Can Gender Role Items Improve the Prediction of Income? Insight from Machine Learning

Miss Klara Raiber (University of Mannheim) - Presenting Author

An appropriate measurement of gender is lacking in social survey research, since in almost all surveys sex (=biological attributes) rather than gender (=socially constructed) is measured. This can be of particular importance for one major outcome of gender inequality, the gender pay gap: Women earn less than men even with the same set of qualifications. If we want to understand how this gap comes about, we should be careful about how we operationalize the concepts. I argue that sex is an inadequate proxy for gender because sex is a binary categorization on biological grounds and gender a non-binary expression of a social process. Additional variables that capture this process should be used when predicting income. Scholars suggest that gender role items can explain some variation of income (Grove, 2011; Stickney & Konrad, 2007). I investigate whether gender role items improve the understanding of gender within quantitative research and whether they are a better proxy of gender than sex is.

By applying random forests, a major machine learning algorithm, on ISSP (2012) data for 44 countries (N=40,051), the impact of gender roles as predictors of personal income is evaluated. Based on prediction performance and variable importance, I compare the outcomes of four models: an empty model with only controls; a model including the sex variable and controls; a model with gender role variables and controls; and finally, a full model with all variables. The results indicate that including gender roles does not increase the predictive power of the models. Sex, compared to gender roles, is more important in the prediction of income. Consequently, contrary to existing literature, gender role items fail to improve the prediction of income. Including a sex variable is still indispensable for survey research.