ESRA 2019 Draft Programme at a Glance

Assessing the Quality of Survey Data 6

Session Organiser Professor Jörg Blasius (University of Bonn)
TimeThursday 18th July, 14:00 - 15:30
Room D02

This session will provide a series of original investigations on data quality in both national and international contexts. The starting premise is that all survey data contain a mixture of substantive and methodologically-induced variation. Most current work focuses primarily on random measurement error, which is usually treated as normally distributed. However, there are a large number of different kinds of systematic measurement errors, or more precisely, there are many different sources of methodologically-induced variation and all of them may have a strong influence on the “substantive” solutions. To the sources of methodologically-induced variation belong response sets and response styles, misunderstandings of questions, translation and coding errors, uneven standards between the research institutes involved in the data collection (especially in cross-national research), item- and unit non-response, as well as faked interviews. We will consider data as of high quality in case the methodologically-induced variation is low, i.e. the differences in responses can be interpreted based on theoretical assumptions in the given area of research. The aim of the session is to discuss different sources of methodologically-induced variation in survey research, how to detect them and the effects they have on the substantive findings.

Keywords: Quality of data, task simplification, response styles, satisficing

Dependent Interviewing - A Remedy or a Curse for Measurement Error in Surveys?

Dr Dimitris Pavlopoulos (Vrije Universiteit Amsterdam) - Presenting Author
Ms Paulina Pankowska (Vrije Universiteit Amsterdam)
Dr Daniel Oberski (Utrecht University)
Professor Bart Bakker (Statistics Netherlands)

The aim of this paper is to assess the effect of dependent interviewing on measurement error in the employment contract type of workers in the Netherlands.

Dependent interviewing (DI) is a data collection technique which uses information from prior interviews in subsequent interview rounds. Longitudinal surveys often rely on this method to reduce respondent burden and achieve higher longitudinal consistency of responses. The latter is also supposed to reduce (random) measurement error. However, DI has also been shown to lead to cognitive satisficing. That is, when confronted with their previous responses, interviewees are tempted to confirm that no changes had occurred and that the answers provided previously still hold. Such behavior increases the probability of obtaining systematic measurement error: if a respondent made an error in the first interview round it is highly likely that this error will be carried over to subsequent rounds.

The study of measurement error in longitudinal categorical data is typically done with a special family of Latent Class models- Hidden Markov Models (HMMs). However, studying non-random, systematic measurement error, such as the one that DI may cause, entails relaxing the local independence assumption of HMMs which requires measurement error to be uncorrelated over time.

Therefore, in our study, we apply an extended, two-indicator HMM to linked data from the LFS and the Dutch Employment Register. The use of two indicators allows relaxing the independence assumption while maintaining model identifiability; this enables modelling auto-correlated (systematic) errors in the LFS. We use data from periods during which it was used fully or partially as well as time periods during which it was not used. Our results show that the overall effect of DI is negligible; while, in line with theory, it lowers random error but increases systematic errors, none of those effects is significant.

Assessing Sources of Error in Two Push-to-Web Surveys Using Linked Income Data

Mr Nicolas Pekari (FORS) - Presenting Author
Professor Boris Wernli (FORS)
Professor Georg Lutz (FORS)

We study how data linkage can be used to provide an empirical basis to assess elements of the TSE framework. A key survey item that often suffers from significant nonresponse and measurement error is income. Using validated income data, we study how data linkage can help us understand the different sources of error. We use data from two surveys: the 2015 Swiss electoral study (Selects), conducted as a CAWI-CATI mixed mode survey and the European Values Study (EVS) 2017, conducted as a CAWI-PAPI mixed mode survey. The initial register-based samples were provided by the Swiss Federal Statistical Office (FSO) and basic socio-demographic information was available for all sampled individuals and household members. This information was then linked with income data from the Swiss social security system, which provides an objective measure of household income for all sample members.
Using the TSE framework and the linked datasets, we address the following questions: 1) Do non-respondents present a different distribution of income compared to respondents, and what is the influence of income on total non-response compared to other known parameters (unit non-response)? 2) Do non-respondents to the survey income question present a specific distribution of income (item non-response)? 3) Which sociodemographic or political indicators influence the difference between declared and register income (measurement error)? 4) How do the two different push-to-web strategies differ in their ability to produce accurate data (idem)? 5) What is the influence of the difference between objective and subjective measures on substantive findings (idem)? We will present the answers to these questions using multivariate analysis to disentangle the impact of different families of explanatory factors.

The Effect of Measurement Error on Clustering Algorithms

Ms Paulina Pankowska (Vrije Universiteit Amsterdam) - Presenting Author
Dr Dimitris Pavlopoulos (Vrije Universiteit Amsterdam)
Dr Daniel Oberski (Utrecht University)

Researchers from many disciplines often employ a variety of clustering techniques, such as K-means, DBSCAN, PAM, Ward, and Gaussian mixture models (GMMs), in order to separate survey data into interesting groups for further analysis or interpretation.

Surveys, however, are well-known to contain measurement errors. Such errors may adversely affect clustering - for instance, by producing spurious clusters, or by obscuring clusters that would have been detectable without errors. Furthermore, measurement error might reduce intra-cluster homogeneity and lower the degree of inter-cluster separation. Yet, to date, the concrete effects that such errors may exert on commonly used clustering techniques have rarely been investigated. While the few existing studies in the field suggest some adaptations to specific clustering algorithms to make them "error-aware", they focus predominantly on random measurement error and make no mention of systematic errors that may exist in the data. In addition, these studies often assume that the extent of measurement error is known a priori, an assumption which is rarely-fulfilled in practice.

In our simulation study, we investigate the sensitivity of commonly used model- and density-based clustering algorithms (i.e. GMMs and DBSCAN) to differing rates and magnitudes of random and systematic measurement errors. We look at the effects of the error on the number of clusters and the similarity of these clusters to the ones obtained in the absence of measurement error. Our analysis shows that, when only one variable is affected, random error substantially biases the clustering results only in rather extreme scenarios, while, even a moderate level of systematic error leads to a significant bias. When all (three) variables contain measurement error, though, both types of error lead to non-ignorable bias. We also find that overall GMM results are more robust to measurement error than DBSCAN.

The Influence of Media Coverage on Political Knowledge over the Data Collection Period

Mr Michael Blohm (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Ms Oshrat Hochman (GESIS - Leibniz Institute for the Social Sciences)
Mr Sebastian Stier (GESIS - Leibniz Institute for the Social Sciences)
Ms Jessica Walter (GESIS - Leibniz Institute for the Social Sciences)

The aim of data collection is to get an understanding of what individuals think about different topics. In order to do this as best as we can, we need to reduce survey errors. One potential component of an error is associated with information respondents gather from their immediate environment during the data collection period. A substantive variation in the exposure to different topics in the mass media might result in a variation of the political knowledge in a society. We examine the influence of information that was present in the mass media during the data collection period on respondents’ answers to political knowledge items. We discuss how responses to political knowledge items in a population are associated, across the data collection period, with the prevalence of political topics in the mass media.
We use political knowledge items and individual characteristics of the respondents collected using a CASI-mode for the German General Social Survey as well as media data (scraped from the online presences of the most important newspapers, political magazines and public broadcasters) parallel to the field. Over a data collection period of 25 weeks, we examine if the degree of political knowledge in the population remains stable over time or if changes in the share of correct answers varies with the salience of related topics in mass media. Besides individual characteristics, like socio-demographics, political interest and media- and internet usage we analyze whether the level of difficulty of the political knowledge questions has an influence on the changes of knowledge over time.