All time references are in CEST
Smart surveys: Measurement, data processing and data integration 1
|Session Organisers|| Dr Peter Lugtig (Utrecht University)
Professor Barry Schouten (Statistics Netherlands)
|Time||Tuesday 18 July, 11:00 - 12:30|
It is well known that surveys have trouble measuring certain topics that are of great interest to social and behavioral scientists.
In recent years several approaches have been proposed to extend or integrate surveys with innovative data collection methods that aim to solve some of the inherent shortcomings of self-reports asked through surveys. One approach is to start with a survey, and then within the survey try to link or collect additional data. Smartphone apps and wearable devices in particular offer a promising way to collect data through a camera, microphone, motion, or location sensors that can be integrated within an app. Another approach is to collect smart survey data to integrate survey data with external sensory, factual or behavioral data after data collection. This can for example be done by linking self-report data from surveys about income to register data from governmental records. Or by asking respondents to donate data on for example their Google history data, or whatsapp call history. Here data are collected separately in surveys and other ways, and only compared and integrated during data analysis.
We are inviting abstracts for a session focusing on the measurement, data processing and data integration of surveys. Papers can focus on, but are not limited to, one or more of the following themes:
- Examples of measurement in smart surveys using smartphone apps, where data are integrated during data collection.
- Examples of measurement in smart surveys, where survey data and external data are integrated after data collection.
- Assessment of data quality using data from multiple sources (surveys and other data sources).
- Methods to integrate or fuse survey data with other data with the goal to improve measurement
- The effects of data integration on timeliness, costs and/or precision of survey estimates.
- The role of the respondent.
Mr Ismael Yacoubou Djima (The World Bank) - Presenting Author
Mr Marco Tiberti (The World Bank)
Mr Talip Kilic (The World Bank)
Plot level crop yields from agricultural surveys remain a key variable in empirical analyses of the economic life of smallholders farmers. Yet, crop-cut measure of yield, an objective measure of yield, is often missing. It is missing whether partially by design, or entirely, in large scale national surveys because of the cost of its implementation. Faced with this gap in objectively-measured yields data, researchers rely on subjective and potentially biased farmers-reported yields, or impute the missing crop-cut data. However, there has been few validation exercises of the imputation approach. Using data from a nationally representative survey in Mali, which collected crop-cut data along side easily obtainable farmers reported harvest, this paper conducts a validation exercise of the imputation approach. We take advantage of the availability of two consecutive rounds of the survey to conduct within-survey and survey-to-survey imputation exercises following a multiple imputation approach based on a machine learning predictive model. We analyze the results for several crops which allows us to draw more general lessons on the conditions for which the approach is valid and on the importance of different predictors including self-reported yields. Our findings are threefold: (i) farmers-reported yields are good predictors of crop-cut yields, but we note a greater predictive power of integrated geo-spatial variables; (ii) in average the imputation exercises work better for crops that are more commercialized, which can be related to the accuracy in standard units of farmer-reported yields; (iii) the imputation approach provides accurate results in the within-survey, but less so in the survey-to-survey framework, especially when statistics are computed at desegregated levels. These results suggest potential important cost savings in survey operations, but also highlight the stringent survey data requirements of the method.
Mr Oriol J. Bosch (The London School of Economics) - Presenting Author
Dr Melanie Revilla (IBEI)
Professor Patrick Sturgis (The London School of Economics)
Professor Jouni Kuha (The London School of Economics)
Measuring what people do online is crucial across all areas of social science research. Although self-reports are still the main instrument to measure online behaviours, there is evidence to doubt about their validity. Consequently, researchers are increasingly relying on digital trace data to measure online phenomena, assuming that it will lead to higher quality statistics. Recent evidence, nonetheless, suggests that digital trace data is also affected by measurement errors, questioning its gold standard status. Therefore, it is essential to understand the size of the measurement errors in digital trace data, and when it might be best to use each data source.
To this aim, we adapt the Generalised MultiTrait-MultiMethod (GMTMM) model created by Oberski et al. (2017) to simultaneously estimate the measurement errors in survey and digital trace data. The GTMM allows both survey and digital trace data to contain random and systematic measurement errors, while accommodating the specific characteristics of digital trace data (i.e., zero-inflation).
To simultaneously assess the measurement quality of both sources of data, we use survey and digital trace data linked at the individual level (N = 1,200), collected using a metered online opt-in panel in Spain. Using this data, we conducted three separate GMTMM models focusing on the measurement quality of survey and digital trace data when measuring three different types of online behaviours: news media exposure, online communication and entertainment. Specifically, for each type of behaviour, we measured three simple concepts (e.g., time spent reading articles about politics and current affairs) with both survey self-reports and digital traces. For each simple concept, we present the reliability and method effects of each data source.
Results provide needed evidence about the size of digital trace data errors, as well as when the use of self-reports might be justified.
Dr Bence Ságvári (Center for Social Sciences, Corvinus University of Budapest) - Presenting Author
Dr Bence Kollányi (Center for Social Sciences)
The study investigates mobile device use and online behaviour of 8-to 15-year-old children from Hungary using a complex mixed-methods methodology. A unique feature of the study is that it uses both automated software data collection to assess app usage patterns and survey data collected from participating children and their parents. The study, conducted in 2022, involved 100 households with school-age children from all over Hungary.
The questionnaire for parents included questions about the parents' educational and professional background, digital literacy and attitudes towards technology. The survey also included questions about children's internet and mobile phone use, including social media use, and questions about parental control and attitudes. The questionnaire for children contained questions on device use, digital literacy, digital education and social media use. Both questionnaires contained a number of questions that could be compared with the data collected from the smartphones and tablets of the participating children
The app recorded the number of screen views, the names of the apps installed on each device and the exact time each app was used. Personal data and other information from the installed apps was not collected. The duration of data collection averaged 30 days per device. We collected 943,892 data points for 1,421 different apps from 75 devices. (No data was collected from 25 devices due to technical and other issues.)
The study contributes to the literature on digital trace collection in two ways. First, it describes a method for collecting data on mobile app use, and second, it explains the methodological challenges of software-based data collection. Secondly, by combining and comparing automatically collected device usage data with survey data, we can both assess the reliability of self-reported usage behaviour and the usage patterns that emerge from the automatically collected data.
Ms Camilla Salvatore (University of Amsterdam) - Presenting Author
Ms Silvia Biffignandi (CESS)
Ms Annamaria Bianchi (University of Bergamo)
Probability sample surveys have been considered the gold standard for inference for many years, but they are facing difficulties related mainly to declining response rates and related increasing costs.
At the same time, an acceleration of technological advances has occurred, with the use of mobile phones and online social networks, specifically social media (SM), leading to the availability of vast amounts of new data. This is coupled with the development of new tools by computational social scientists to collect, process, and analyse digital trace data.
This article provides an overview on the roles of social media in survey research (as a substitute, as a supplement and to improve survey estimates) and for the production of smart statistics. We then introduce a general modular framework for producing smart statistics taking advantage of the two data sources. Such modular framework can be used and adapted by researchers in different contexts. We demonstrate its applicability through a case study. Finally, we highlight important questions for future research.
Dr Nina Berg (Statistic Norway (SSB))
Dr Gezim Seferi (Data collections at Statistics Norway (SSB)) - Presenting Author
Statistics Norway would like to share our first experiences with use of smart surveys in official statistics.
The data collection for the Household Budget Survey 2022 has just been completed, and the Time use Survey recently finished a follow-up study of the first participants after field start Q4, 2022. Both surveys use a PWA (Progressive Web App) for data collection with “smart” interactions that are new features in questionnaire design in official statistics. The Household Budget Survey also use optical character recognition or scanning, which is based on sensor technology in digital devices such as the mobile, and often referred to as “smart” technology. We would like to share our experiences from these two surveys and describe how the participants use and interact with “smart” devices and “smart” features, and how they assess the plausibility of the data they have registered and how they perceive the response burden.
From both surveys we have gathered insights on topics like participation; privacy; and use and quality of “smart” and “non-smart” features that we will share. And we have gain understanding of different subgroups that either are hard to recruit or struggle with a digital format. Our data are based on qualitative observations from numerous user tests from the last couple of years, but also some quantitative figures on sample bias and use of different devices and features.
We hope by sharing what we have learned about the user experience we can contribute to the awareness of the importance of a respondent’s perspective to achieve good quality in official statistics.