Program at a glance 2021

Advances in modelling

Session Organiser Dr Cecil Meeusen (University of Leuven)
TimeFriday 16 July, 13:15 - 14:45

This panel explores how advanced techniques such as machine learning, survey experiments the use of digital data can overcome certain shortcomings in traditional survey analysis. Specifically, the papers in the panel address issues such as outlier detection, interview fabrication, and low response rates appearing in survey research and propose machine learning techniques to overcome these problems. Two other papers illustrate how innovative panel designs and the complementary use of digital trace data can advance out knowledge about attitude and value change.

Keywords: Survey, machine learning, panel data, digital trace data

A Privacy-Preserving and Interepretable Approach to Modelling and Predicting Web Survey Response Propensities with Time-varying Predictors (Timing, Weather and Societal Trends)

Mr Qixiang Fang (Utrecht University) - Presenting Author
Dr Joep Burger (Statistics Netherlands)
Dr Ralph Meijers (Statistics Netherlands)
Dr Kees van Berkel (Statistics Netherlands)

Despite their popularity, web surveys tend to suffer from low response rates and consequently, compromised data quality. Therefore, both knowledge about influencing factors of (non)responses and the ability to predict nonresponse in advance are important, as they can help survey researchers to design surveys and plan data collection in a way that would optimize response rates.

Response propensity models are often used for this purpose. Many have been proposed in the literature. Here, we highlight three aspects that can benefit from further methodological innovation.

First, we argue that in order for such models to be as helpful as possible for survey researchers, it is crucial to strike a good balance between model interpretability and prediction performance, which most existing models have fallen short of. Many emphasize high interpretability (e.g. simple models with a handful of predictors) but pay little attention to their generalizability to future data. Others (e.g. black-box machine learning models with a large number of predictors) focus on achieving high prediction accuracy but largely ignore the need for interpretability.

Second, prior research often uses predictors that require personal data, such as demographics, personal values, and personality traits. Given increasing public concerns and stricter legal regulations regarding data privacy, we believe that it is desirable to move away from relying on such predictors and resort to a privacy-preserving solution that makes use of only non-personal data.

Third, existing response propensity models seem to have largely ignored the fact that response propensities may vary over time due to time-varying factors like personal availability, mood, and the weather. We argue that time-varying factors can provide both researchers and prediction models with (additional) valuable information about non(response) decisions.

Our paper aims to address these three aspects in two steps. First, we present a modelling approach which not only achieves a good balance between interpretability and prediction performance but is also capable of incorporating time-varying predictors. Specifically, we used the discrete-time survival analysis framework complemented with two techniques from the statistical learning literature: lasso regularization and component-wise gradient boosting. These techniques improve both the interpretability and the prediction aspects of the original approach. Second, we demonstrate the use of non-personal time-varying predictors, such as the day of a week, public holidays, daily weather (e.g. temperature, precipitation and sunshine duration) and indicators of daily societal trends (e.g. disease outbreaks, public outdoor engagement, privacy concerns and terrorism salience).

Following these two steps, we analysed the daily response behaviour observed during the web mode of the 2016 and 2017 Dutch Health Surveys. The resulting model estimates provide insight into how timing, weather, and societal trends affect response propensities, which can guide better data collection planning. We further validated the reliability of these found effect estimates on independent, “future” data. Similarly, we show that our models could accurately predict both daily and cumulative response rates on unseen data.

Identifing outliers to improve survey statistics

Professor Jörg Blasius (Universität Bonn) - Presenting Author
Professor Susanne Vogl (University of Stuttgart)

A few outliers can affect the entire solution in a large data set. This is especially true when analyzing sets of variables within a multivariate approach, for example, item batteries measuring latent attitudes. In both principal component and factor analyses, rotation of results is a standard feature of multivariate data analysis. In many cases, the un-rotated solutions remain un-published. Most often applied is the Varimax rotation, which usually allows a better interpretation of the content, since it better adapts the variable clusters. In the presentation, we show that such a rotation often only optimally adapts the outliers of a survey, i.e., respondents who are often characterized by a (strong) satisficing behavior. As an empirical example, we use data from a Viennese study with 14 to 16-year-old pupils from non-advanced schools; the data have been collected in 2018 within a web survey. With low educational attainment and more than half of the respondents having German as their second language only, we can show that some pupils gave arbitrary answers that affect the entire solution.

Evaluating Machine Learning Algorithms to Detect Interviewer Falsification

Ms Silvia Schwanhäuser (Institute for Employment Research (IAB)) - Presenting Author
Ms Yuliya Kosyakova (Institute for Employment Research (IAB), University of Bamberg, and University of Mannheim)
Ms Natalja Menold (University of Dresden (TU-Dresden))
Mr Joseph Sakshaug (Institute for Employment Research (IAB), University of Munich (LMU), and University of Mannheim)
Mr Peter Winker (University of Giessen)

Interviewers play a vital role for the quality of survey data, as they directly influence response rates and are responsible for appropriately administering the questionnaire. At the same time, interviewers may be enticed to intentionally deviate from the prescribed interviewing guidelines or even fabricate entire interviews. Different studies have discussed various possibilities to prevent and detect such fraudulent interviewer behavior. However, the proposed controlling procedures are often time consuming and their implementation is cumbersome and costly.

One understudied possibility to simplify and automate the controlling process is to use supervised machine learning algorithms. Even though some studies propose the use of unsupervised algorithms like cluster analysis or principal component analysis, there is hardly any literature on otherwise widespread methods like neural networks, support vector machines, decision trees, or naïve Bayes. This is mainly driven by the lack of appropriate test and training data, including sufficient numbers of falsifiers and falsified interviews to evaluate the respective algorithms.

Using data from a German experimental study, including an equal share of falsified and real interviewers as well as real-world data from a German panel survey with fraudulent interviews in different waves, we address the question: How well do supervised machine learning algorithms discriminate between real and falsified data? To do this, we evaluate the performance of different algorithms under various scenarios. By utilizing different data sources and working with different subsets for training and testing the algorithms within and across datasets, we provide additional evidence regarding the external validity of the results. In addition, the setting allows us to draw conclusions on the different strategies and behaviors of falsifying interviewers.

Predicting Basic Human Values of Youth from Digital Traces

Mr Mikhail Bogdanov (National Research University "Higher School of Economics") - Presenting Author

Several studies demonstrate that some human traits and attributes are predictable from digital traces on social media (Kosinski et al., 2013, 2016). Probably the most studied phenomenon is personality traits. Meta-analyses have shown that personality traits are predictable from digital traces on social media (Azucar et al., 2018; Huang, 2019; Settanni et al., 2018). However, there are only a few studies that attempt to predict basic human values from the digital traces. Although the values are more socially constructed than personality traits and, therefore, might reflect in people’s social media profiles and behavior.
Moreover, most studies use digital traces from globally popular social media platforms (Facebook and Twitter). There is substantively less number of studies that employ data of local social media platforms. In this study, we try to fill these niches by predicting Schwartz’s Basic Human Values using digital traces from Russian social network platform “VK” (analogue of Facebook).
Our analysis was based on the data of nationally-representative cohort panel study - “Trajectories in Education and Careers” (TrEC). This study is based on the cohort of eight graders of 2011 who participated in the international study “Trends in Mathematics and Science Study” (TIMSS). Now the modal age of these respondents is 24 years. We use the survey data on Schwartz’s Basic Human Values measured with 21-item Portrait Values Questionnaire (PVQ) in the recent wave and digital traces from the social media platform “VK”. The VK is a leading social media platform in Russia with over 90% penetration among youth.
We employed various machine learning techniques (elastic net, RF, SVM, etc) to predict four Higher Order Values from subscriptions to the public pages on this platform and found that values are less predictable from digital traces than personality traits (Big-5). Correlations between predicted and true values of best models ranges between 0.1 and 0.23. We also have used text data from the posts on VK to predict values. Models with text data from VK demonstrate slightly higher performance. We also have explored text features that are highly predictive of different higher order values.

Methodological Aspects of Measuring Stability and Change in Personal Culture: Testing the Models of Settled Dispositions and Active Updating Using a Randomized Experiment

Dr Henning Silber (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Professor Bella Struminskaya (Utrecht University)
Dr Matthias Sand (GESIS - Leibniz Institute for the Social Sciences)
Professor Michael Bosnjak (Leibniz Institute for Psychology (ZPID))
Ms Joanna Koßmann (Leibniz Institute for Psychology (ZPID))
Mr Fabienne Kraemer (GESIS - Leibniz Institute for the Social Sciences)
Dr Bernd Weiß (GESIS - Leibniz Institute for the Social Sciences)

One of the main advantages of panel data analysis is that it allows monitoring within-person stability and change over time. In an article in the American Sociological Review from 2020, Kiley and Vaisley used the panel data components of the General Social Survey from 2006 to 2014 to test 183 attitudinal and behavioral items with respect to stability or change over the course of three panel waves (using responses from previous waves as predictors of the responses in the final wave). Their analysis, which included the Bayesian Information Criterion (BIC), showed a pattern that concurred with the assertion of stability rather than changes over time. That means respondents show settled dispositions rather than actively updating their attitudes and behaviors when asked the same questions repeatedly. Further analyses showed that most of the observed change could be attributed to short-term attitude change or measurement error. When respondents actively updated their attitudes and behaviors between panels waves, it was more likely among younger respondents whose value systems are thought to be most perceptual for evolution over time.
The present research aims to replicate and extend the analyses done by Kiley and Vaisely (2020). To achieve this, we use data from a randomized experiment carried out in a non-probability panel. Specifically, we collected six waves of panel data in a German online access panel between 2020 and 2021, which allows us to examine repeated responses to more than 40 attitudinal and behavioral questions over time. In a first step, we will replicate the previous stability and change analyses to see if we find a similar overall pattern as well as a similar pattern in the respective age groups. In a second step, we will extend the analyses of Kiley and Vaisley (2020). Our experimental study design includes groups of respondents randomly assigned to the number of times (between two to six times) of which they received the target attitudinal and behavioral items. This design allows us to distinguish more systematically between measurement error versus stability or change than Kiley and Vaisley’s design allowed since it lacked the random assignment to experimental or control groups. Specifically, we can test if prior exposure to the target attitudinal and behavioral items leads to deliberation and reflection processes, which in turn may contribute to the confirmation of the settled disposition model.