ESRA logo

ESRA 2023 Glance Program

All time references are in CEST

Combining Data Science and Survey Research to Improve (Training) Data Quality

Session Organisers Dr Christoph Kern (LMU Munich)
Dr Ruben Bach (University of Mannheim)
Mr Jacob Beck (LMU Munich)
Dr Stephanie Eckman (RTI International)
TimeWednesday 19 July, 11:00 - 12:30
Room U6-05

The data science world has made considerable progress in the development of machine learning techniques and algorithms that can make use of various forms of data. These methods have been picked up by survey researchers in various contexts to facilitate the processing of survey, text and sensor data or to combine data from different sources. Ultimately, the hope in these applications is to make use of various data sources efficiently while maintaining or improving data quality.

The data science field has paid less attention to creation and use of high-quality training data in the development of algorithms. AI researchers, however, are increasingly realizing that insights from their models are constraint by the quality of the data the models are trained on. Selective participation or measurement error in data generation and collection can considerably affect the trained models and their downstream application. Misrepresentation of social groups during model training, for example, can result in disparate error rates across subpopulations.

This session brings together methodological approaches from both data science and survey science that aim to improve data quality. We will discuss how both disciplines can learn from each other to improve representation and measurement in various data sources. We welcome submissions that demonstrate how data science techniques can be used to improve quality in data collection contexts, e.g. for

- utilizing information from digital trace data
- improving inference from non-probability samples
- integrating data from heterogeneous sources

The session will also feature studies that utilize the survey research toolkit for improving training data quality, e.g. to

- assess and improve the representativity of training data
- collect training data with better labels
- improve the measurement of the input used for prediction models.

Keywords: Data Science, Machine Learning, Data Quality


Improving Measurement Models using Deep Neural Networks. The case of Response Styles

Professor Artur Pokropek (IFiS PAN) - Presenting Author
Dr Tomasz Żółtak (IFiS PAN)
Dr Marek Muszyński (IFiS PAN)

Noncognitive constructs such as personality traits, attitudes, interests and reported behaviour are of great interest in every area of the social sciences. They are predominantly measured using standardised questionnaires for self-report, which are usually based on rating scales (e.g. Likert-type scales). The popularity of the self-report methods is due to their cost-efficiency, ease of administration and flexibility to assess a broad range of constructs. However, the use of self-report does not come without problems. However, there is much evidence that response styles (RS) can undermine the validity of self-report measurement, inducing skewed distributions; „theta shift,”; changed multivariate correlations; misrepresented factorial structure; distorted internal consistency; obscured cultural differences; inferential errors. Detecting different types of RS in measurement models is, however, challenging. Although some analytical methods exist, their efficacy is questioned.

We present a method based on Deep Neural Networks (DNNs) that overcomes traditional approaches and detects a broad spectrum of different response styles tendencies. In a simulation study, the approach based on DNNs was tested against the multidimensional generalised partial credit model and IRTree models. The obtained results showed that both midpoint and extreme response patterns could be successfully detected both by traditional approaches based on model comparisons and the approach based on DNNs. However, for other types of abnormal survey responding (e.g. diagonal lining, extreme bouncing or random responding), DNNs are unrivalled, with the detection accuracy of a given behaviour amounting between 97 and 100%. Other methods compared perform much worse or are simply unable to model such response patterns. In this study, we focus on simulation studies but also present short empirical examples from real-life surveys to show the strengths of the proposed approach

Automatic Scoring of Cognition Drawings - Assessing the quality of machine based scores against a gold-standard

Dr Arne Bethmann (MEA-SHARE and SHARE Berlin Institute) - Presenting Author
Ms Charlotte Hunsicker (MEA-SHARE and SHARE Berlin Institute)
Ms Claudia Weileder (MEA-SHARE and SHARE Berlin Institute)
Ms Marina Aoki (MEA-SHARE and SHARE Berlin Institute)

The drawing of figures is often used as part of dementia screening protocols. SHARE has adopted three drawing tests from Addenbrooke’s Cognitive Examination III as part of its questionnaire module on cognition. While the drawings are commonly scored by trained clinicians, SHARE uses the face-to-face interviewers conducting the SHARE interviews to score the drawings in the field. This might pose a risk to data quality, since interviewers might be less consistent in their scoring and more likely to make errors, due to the lack of clinical training.

Building on preliminary results presented at the CSDI Workshop 2022 (, we compare the performance of deep learning models against a gold-standard / ground truth score developed using manual re-scoring from multiple trained raters and an arbitration process that consolidates divergences into a single score. We assume that this procedure will get the error rate reasonably close to the bayes error rate, although it certainly remains a point for further research.

We then go on and test several convolutional neural network architectures and optimization approaches. Regarding the training data we assess the feasibility of using actual interviewer provided scores as compared to using the gold-standard itself as training data. Finally, we will also investigate whether the combination of both, using the more precise gold-standard for refining the training on the (eventually) abundant interviewer scores, might yield an efficient trade-off between precision and scoring effort.

We are currently in the process of collecting the drawings from additional SHARE countries and waves. As soon as this is finished, we will start preparing the data for public release as a strictly anonymized datset. Likely, the final dataset will contain the three drawings for close to 50,000 cases

Improving Alignment between Survey Responses and Social Media Posts

Professor Frederick Conrad (University of Michigan) - Presenting Author

The benefits of tracking public opinion by analyzing social media posts are potentially great if such measures might sometimes augment or even stand in for corresponding survey measures. An important step in evaluating this possibility is to determine when patterns of survey responses and social media posts are most likely to be aligned, i.e., to move up and down together over time. Here we explore the factors affecting alignment between responses to 23 survey questions measuring the US public’s opinions and knowledge about the 2020 US Census (76,919 online respondents) from January to September 2020 and a corpus of tweets (n=3,499,628) from the same time-period containing keywords about the US Census Bureau. We find that alignment is more likely when (1) survey responses vary enough so that there is movement for the posts to align with, i.e., the signal to noise ratio is high, and (2) the social media corpus is composed of content that is semantically related to each survey question, e.g., for a question that asks respondents how much they trust Federal statistics in the US, restricting tweets to those that concern trust of federal statistics increases alignment. Alignment is further improved by restricting the corpus to tweets that express the same opinion (stance) as the survey response whose movement is the target, not just the same topic as the question. We tested the effect of stance on alignment for one question — whether or not the census would ask about householders’ citizenship. We trained a model on example tweets that were either manually or automatically labeled; in both cases, comparing just the tweets that the model classified as expressing the will-not-ask stance, alignment was improved. These results begin to point the way toward realistically using social media

Assessing the Downstream Effects of Training Data Annotation Methods on Supervised Machine Learning Models

Mr Jacob Beck (LMU Munich) - Presenting Author
Dr Stephanie Eckman (Independent)
Mr Christoph Kern (LMU Munich)
Mr Rob Chew (RTI International)
Professor Frauke Kreuter (LMU Munich)

Machine learning (ML) training datasets often rely on human-annotated data collected via online annotation instruments. These instruments have many similarities to web surveys, such as the provision of a stimulus and fixed response options. Survey methodologists know that item and response option wording and ordering, as well as annotator effects, impact survey data. Our previous research showed that these effects also occur when collecting annotations for ML model training and that small changes in the annotation instrument impacted the collected annotations. This new study builds on those results, exploring how instrument structure and annotator composition impact models trained on the resulting annotations.

Using previously annotated Twitter data on hate speech, we collect annotations with five versions of an annotation instrument, randomly assigning annotators to versions. We then train ML models on each of the five resulting datasets. By comparing model performance across the instruments, we aim to understand 1) whether the way annotations are collected impacts the predictions and errors by the trained models; and, 2) which instrument version leads to the most efficient model, judged by the model learning curves. In addition, we expand upon our earlier findings that annotators' demographic characteristics impact the annotations they make. Our results emphasize the importance of careful annotation instrument design. Hate speech detection models are likely to hit a performance ceiling without increasing data quality; By paying additional attention to the training data collection process, researchers can better understand how their models perform and assess potential misalignment with the underlying concept of interest they are trying to predict.