All time references are in CEST
From surveys to digital behavior: Data quality concepts, quality indicators, and linkage errors |
|
Session Organisers | Dr Jessica Daikeler (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany ) Dr Ruben Bach (University of Mannheim, Germany) Mr Leon Fröhling (GESIS Leibniz Institute for the Social Sciences, Cologne, Germany ) Dr Joachim Piepenburg (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany ) Dr Henning Silber (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany ) |
Time | Friday 21 July, 09:00 - 10:30 |
Room | U6-01a |
From surveys to digital behavior: Data quality concepts, quality indicators, and linkage errors
Traditionally, data types relevant to the social sciences can be divided into three types: survey, observational, and content data (Purdam and Elliot 2015; Schnell, Hill, and Esser 2009:313). While survey data have been the focus of quantitative social science analyses for a long time, observational and content data, although equally long-established, have gained renewed attention with the increasing popularity of computational social science approaches (Callegaro and Yang, 2018; Japec et al. 2015; Jungherr et al. 2017; Stier et al. 2020). However, the amount and granularity of data can quickly mislead researchers when it comes to data quality. Even the most innovative and extensive amounts of data are insufficient if they are not high-quality.
This session dedicates to data quality in this new era of social sciences. It focuses on applied and theoretical data quality questions in collecting, analyzing, and documenting digital content data and other data sources. Furthermore, this session also concentrates on quality aspects of data donations in surveys and data linkage of survey data with other data sources. Among others, submissions valid for this session may target questions on:
- Data quality frameworks for survey data, digital behavior data, and other data types
- Indicators to assess the data quality of digital behavior data
- Validity testing
- Measurement error
- Representation of digital behavior data
- Software testing
- Documentation issues
- Methods and errors of data donations and data linkage
- Linkage operations
- Consent
- Coverage bias
- Platform compliance
Keywords: data quality, data donation, data linkage, digital behavior data, survey data, digital content data, error concept, consent
Dr Jessica Daikeler (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany ) - Presenting Author
Miss Indira Sen (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany )
Mr Lukas Birkenmaier (GESIS Leibniz Institute for the Social Sciences, Cologne, Germany )
Mr Leon Fröhling (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany )
Dr Tobias Gummer (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany )
Dr Clemens Lechner (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany )
Dr Henning Silber (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany )
Dr Bernd Weiss (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany )
Dr Katrin Weller (GESIS Leibniz Institute for the Social Sciences, Mannheim, Germany )
Relevance & Research Question: Only 30 years ago, a few could anticipate the possibilities in data collection offered by devices such as computers and smartphones. Today, new technologies allow social scientists to track “ordinary behavior” by clustering activities and opinions on online platforms (e.g., social media), and have opened new avenues for analyzing, understanding, and addressing social science research questions. To target social science data quality within this new era of computational social science it is essential to link quality concepts of the information and computer sciences with those in the social sciences. Consequently, the present study aims to systematize social science data quality concepts in the light of old and new social science research data.
To guide researchers in questions on data quality, our study aims to facilitate interdisciplinary exchange by providing a comprehensive and systematic review of existing frameworks on data quality. By investigating our research question, we will provide answers for practical questions such as: Is the association between data quality concepts of the information and computer sciences and the social sciences already mapped out in the existing data quality concepts? Which quality dimensions, design decisions, and quality indicators are currently represented in existing quality concepts, where are conceptual gaps, and which quality concept is most appropriate given the researcher's data and research questions?
Methods: We develop and present our results with the help of a systematic review and use for the systematic literature search and coding.We relied on text mining methods to conduct this systematic literature search and coding approach.
Added Value: Results from our study will contribute to the identification of relevant data quality frameworks for social scientists with both traditional and new data types. Additionally, our study will facilitate interdisciplinary exchange between the computer and social sciences.
Dr Alexandru Cernat (University of Manchester) - Presenting Author
Dr Florian Keusch (University of Mannheim)
Dr Ruben Bach (University of Mannheim)
Dr Paulina Pankowska (Utrecht University)
Digital trace data are receiving increased attention as a potential way to capture human behavior. Nevertheless, this type of data is far from perfect and may not always provide better data compared to traditional social surveys. In this study we use an experimental design in which we collected five topics relating to the use of mobile phones using five methods, three different survey scales and two measures from digital trace data. We show that surveys and digital trace data measures have very low correlation with each other. We also show that all measures are far from perfect and, while digital trace data appears to have often better quality compared to surveys, that is not always the case. Finally, we find that the duration measures both in surveys and digital trace data have the best quality out of the methods we compared.
Miss Indira Sen (GESIS) - Presenting Author
Social scientists, including computational social scientists, want to find manifestations of social phenomena in large-scale digital trace data to study phenomena like sexism, workplace depression, conspiratorial beliefs, etc., in digital traces. While these traces have many potentials in improving social science inquiry, there are also obstacles. One such challenge is that the scale of digital traces necessitates the use of automated methods like machine learning (ML). But ML methods for specific constructs like conspiratorial beliefs, workplace depression, are often supervised, therefore requiring large labeled training datasets. Manual labeling is expensive and can lead to considering only a few dimensions of the construct. Consequently, models end up being defined in ad-hoc manner without theory. In this work, I propose to leverage survey items and sentence embeddings to improve how these models are defined and operationalized. Survey items are a way of incorporating theory and precludes the need for extensive manual labeling. Current Natural Language Processing methods can encode text, typically sentences, into these representations called sentence embeddings. These methods are trained on a massive corpora of text data in an unsupervised manner, allowing these methods to ‘learn’ words in context and have a notion of semantic similarity with conceptually related sentences. Specialized methods exist for generating representations tailored for the task of semantic textual similarity, i.e., the representations are optimized for finding similar content. So we leverage these sentence embeddings to find digital traces that are related to survey items measuring certain social phenomena. We apply this method to measuring workplace depression. We adapt a work-related depression scale called Occupational Depression Inventory (ODI), gather more than 350K employee reviews of 104 major companies, and develop a framework scoring these reviews on a composite ODI score, using the similarity between the ODI items and the review sentences.