ESRA 2019 Programme at a Glance

Predictive Modeling and Machine Learning in Survey Research 2

Session Organisers Mr Christoph Kern (University of Mannheim)
Mr Ruben Bach (University of Mannheim)
Mr Malte Schierholz (University of Mannheim, Institute for Employment Research (IAB))
TimeThursday 18th July, 14:00 - 15:30
Room D24

Advances in the field of machine learning created an array of flexible methods for exploring and analyzing diverse data. These methods often do not require prior knowledge about the functional form of the relationship between the outcome and its predictors while focusing specifically on prediction performance. Machine learning tools thereby offer promising advantages for survey researchers to tackle emerging challenges in data analysis and collection and also open up new research perspectives.

On the one hand, utilizing new forms of data gathering, e.g. via mobile web surveys, sensors or apps, often results in (para)data structures that might be difficult to handle -- or fully utilize -- with traditional modeling methods. This might also be the case for data from other sources such as panel studies, in which the wealth of information that accumulates over time induces challenging modeling tasks. In such situations, data-driven methods can help to extract recurring patterns, detect distinct subgroups or explore non-linear and non-additive effects.

On the other hand, techniques from the field of supervised learning can be used to inform or support the data collection process itself. In this context, various sources of survey errors may be thought of as constituting prediction problems which can be used to develop targeted interventions. This includes e.g. predicting noncontact, nonresponse or break-offs in surveys to inform adaptive designs that aim to prevent these outcomes. Machine learning provides suitable tools for building such prediction models.

This session welcomes contributions that utilize machine learning methods in the context of survey research. The aim of the session is to showcase the potential of machine learning techniques as a complement and extension to the survey researchers' toolkit in an era of new data sources and challenges for survey science.

Keywords: machine learning, predictive models, data science

Efficient Free Text Paradata Analysis through Machine Learning

Mr Andrew Latterner (NORC) - Presenting Author
Ms Melissa Heim Viox (NORC)
Mr Ryan Buechel (NORC)
Miss Kelly McGarry (NORC)
Dr Felicia LeClere (NORC)

By utilizing paradata, survey administrators have an increased ability to iteratively improve and manage data quality; paradata such as interviewer questionnaire comments and interviewer notes on contact attempts, or records of call, can provide key insights regarding the survey methodology, response intricacies, and even administrative process improvements. However, in order to consistently utilize these data and intelligently respond to them in the form of modification of questions and methodologies, survey managers usually must devote extensive manual review and categorization. This review process can be both expensive and time-consuming. These manual processes can be expensive in terms of time and money. We applied natural language processing (NLP) and machine learning (ML) modeling techniques to automate this annotation process and to reduce the time required to glean insights from said data.
The Medicare Current Beneficiary Survey (MCBS) is a continuous, multipurpose survey of a nationally representative sample of the United States Medicare population, conducted by the Centers for Medicare & Medicaid Services through a contract with NORC at the University of Chicago. Using data from the MCBS we were able to classify questionnaire comments collected from MCBS interviewers into distinct categories ranging from general respondent confusion to a need for data correction, with a strong enough reliability to greatly reduce required manual review. This automatic classification allowed us to understand potential sources of confusion or error in survey questions, as well as identify interviewer training opportunities, without the need for item-level coding of paradata. We also used NLP and ML techniques to classify records of call to better understand interviewer performance and contacting patterns.

Automatic Coding of Open-Ended Questions: Should You Double Code the Training Data

Professor Matthias Schonlau (University of Waterloo) - Presenting Author
Ms Zhoushanyue He (University of Waterloo)

Open-ended questions in surveys are often manually coded into one of several classes. When the data are too large to manually code all texts, a statistical (or machine) learning model must be trained on the manually coded subset of texts. Uncoded texts are then coded automatically using the trained model. The quality of automatic coding depends on the trained statistical model, and the model relies on manually coded data on which it is trained. While survey scientists are acutely aware that the manual coding is not always accurate, it is not clear how double coding affects the classification errors of the statistical learning model.

When there is a limited budget for manual classification, how should this budget should be allocated to reduce classification errors of statistical learning algorithms? We investigate several allocation strategies: single coding vs. various options for double coding with a reduced number of training texts (because of the fixed budget).
When coding cost is no concern, we found all double-coding strategies generally outperformed single-coding. Under fixed cost, double coding improved prediction of the learning algorithm when the coding error is greater than about 20-35%, depending on the data. Among double coding strategies, paying for an expert to resolve differences performed best. When no expert is available, removing differences from the training data outperformed other double coding strategies.

Data-Processing Techniques for the Extraction and Classification of Data Generated from Shopping Receipts

Mr Brendan Read (University of Essex) - Presenting Author

By collecting images of textual data, in this case receipts, it is possible to capture a rich source of additional information to that obtained through traditional survey methods. However, the challenge such data present is extracting relevant information and curating this into a useful format. In the case of shopping receipt data, one such useful format might be the categorisation of item descriptions. Manual extraction and classification of data is a time consuming and costly process that is not easily scalable. Automating this process may help to reduce the costs in terms of time and effort, improving scalability.

Machine learning tools offer a potential extension to traditional data processing techniques to improve upon the accuracy and ease with which data is extracted and classified. However, the performance of machine learning techniques, as well as other more traditional techniques of data processing, should be examined empirically. This research takes advantage of data from the Understanding Society Spending Study One, where captured images of receipts were manually coded and classified. This offers an opportunity to evaluate the success of machine learning tools, and other methods of automated data collection by comparing against this manually coded dataset. In addition, the fact that the Spending Study is embedded within an existing probability-based household panel study, and the rich set of covariates this provides, also allows analysis into the extent to which different approaches to automation may result in biased estimates from the data generated. This research therefore compares different data processing techniques to answer the following research questions: what is the best approach for extracting data from scanned images of receipts? What is the best approach for classifying purchased items in the extracted data? To what extent do these automated processes introduce systematic biases to estimates generated using the processed data?

Utilising Machine Learning for Making Sense of High-Dimensional Survey Data: A Case with Voting Advice Applications

Mr Guillermo Romero (University of Southampton) - Presenting Author
Mr Enrique Chueca (Kings College London)
Mr Javier Padilla (The London School of Economics Political Science)

Surveys often provide data in a space composed of many dimensions with complex interactions between them. Simple models cannot always cope with such complexity, so Machine Learning techniques are a good alternative that can efficiently capture the underlying structures. We focus on Voting Advice Applications (VAAs), which are web platforms that recommend political parties according to the user’s preferences on a set of relevant policy issues. They also contain the parties’ positions on the same set of issues and use them for computing the recommendation. Since users are also asked to state the party they plan to vote for, supervised learning techniques can be employed for making predictions on the right party to recommend. While the goal of the application is producing a recommendation, by inspecting the model the Machine Learning algorithm learned can help in understanding which policy issues are more relevant in the voting decision process and quantifying how users interpret their level of agreement within each policy issue. In this work, we implement a bespoke Machine Learning algorithm that learns from this data with special care in preserving the interpretability of the model and we analyse how the learned model can effectively provide aggregated information of the user sample.