ESRA logo

ESRA 2023 Glance Program

All time references are in CEST

Evaluating Quality of Alternative Data Sources to Surveys 2

Session Organiser Dr Paul Beatty (U.S. Census Bureau)
TimeWednesday 19 July, 11:00 - 12:30
Room U6-28

Alternative data sources (including but not limited to administrative records, passively collected data, and social media) are increasingly seen as potential complements to, and in some cases replacements for, self-reported survey data. These alternatives are especially appealing given rising costs and declining response rates of traditional surveys. Incorporating them into datasets may improve completeness and timeliness of data production, reduce response burden for respondents, and in some cases provide more accurate measurements than self-reports.

Nevertheless, the sources and extent of measurement errors within these alternative data sources, as well as reasons for missingness, are not always well understood. This can create challenges when combining survey and alternative data, or when substituting one for the other. For example, apparent changes over time could be artifacts due to record system characteristics rather than substantive shifts, and reasons for exclusion for administrative records may differ from reasons to decline survey participation. A clear understanding of the various errors and their consequences is important for ensuring validity of measurement.

This session welcomes papers on methodology for evaluating quality of alternative data sources, either qualitative or quantitative; that explore the causes, extent, and consequences of errors in alternative data sources; or that describe case studies illustrating challenges or successes in producing valid data sets that combine survey and alternative data sources.

Keywords: alternative data sources, administrative records, data quality


Measuring Long COVID in Electronic Health Records: The Truth is Out There in Organic Data

Dr Minjung Han (Seoul National University)
Professor Asaph Young Chun (Seoul National University) - Presenting Author
Miss Seyoung Kim (Seoul National University)
Dr Su Jin Kang (Seoul National University)

Electronic Health Records (EHR) are data collected primarily for the administrative purpose of patient monitoring by hospitals and clinics. The utility of EHR as administrative data has been widely discussed for decades to supplement complex health surveys and censuses across the Atlantic (Chun, Larsen, Durrent, Reiter, 2021). The quality of organic data based on EHR is central to making decisions regarding their use over the survey life cycle, from conceptualization to data collection to model-based analysis.

This paper presents a framework for evaluating the quality of organic data from EHR in South Korea. We use organic data from the Korean National Health Insurance Services (NHIS) and the Health Insurance Review and Assessment Service( HIRA) to define, classify, and assess Long COVID, namely the post-acute effects of COVID-19 infection that include physical and mental symptoms, particularly among children and adolescents. Launched in 2000 by the Korean government, the NHIS has been in place to reduce disease burden and improve coverage for physical and mental health of Koreans. The NHIS provides comprehensive socio-demographic information, clinical diagnosis, prescription medication and history of the health records with 97% coverage of the Korean population. We discuss the practical implications of using organic data from EHR to advance survey methodology and build evidence in health policymaking in Korea.

Predicting COVID-19 hospitalizations from survey, sensor, and sewage data: an international comparison

Dr Jonas Klingwort (Statistics Netherlands) - Presenting Author
Dr Joep Burger (Statistics Netherlands)
Professor Jan van den Brakel (Statistics Netherlands)

Throughout the COVID-19 pandemic, a major objective was to keep the number of hospitalizations low to prevent the healthcare infrastructure from collapsing. Corona Dashboards were filled with several indicators that were hypothesized to predict hospitalization. In this paper, the relation between the weekly number of COVID-19 hospitalizations and indicators from surveys, fixed and mobile sensors, and wastewater treatment plants (WTPs) are modeled for the Netherlands and Germany (2020-2022). The surveys provide data on reported behavior and opinions and are based upon both probability- and non-probability panels. The fixed sensors record pedestrian flows within metropolitan areas. The mobile sensors record population mobility with mobile phones (Google Mobility Reports). Both sensor systems provide non-probability data. All WTPs provide data on viral load (census).

Using structural time series modeling, the effect of the survey, sensor, and sewage data on weekly COVID-19 hospitalization frequency is estimated using a dynamic regression model. In this way, the regression coefficients that describe the relationship between the data sources and hospitalization frequency are time-dependent and describe how these relations evolve during the COVID-19 pandemic. The models are fitted using the Kalman filter, after expressing them in state space form.

Two independent models will be presented and will inform about results for the Netherlands and Germany. In particular, effects will be reported between a) data sources (surveys, sensors, WTPs), b) probability and non-probability sample surveys, c) surveys and sensors, d) fixed and mobile sensors, and e) countries.

We provide evidence for which indicators might predict hospitalizations during a future pandemic or other scenarios in which low hospitalization frequency is essential. This will inform about the relevance of surveys compared with other data sources. Therefore, we consider these results of high practical importance for survey practitioners and policymakers.

Understanding Measurement Error in Crime Data Measured at Multiple Scales: Applying a Novel Meta Multi-Trait Multi-Method Models to Police Recorded Crime and Survey Data

Dr Alexandru Cernat (University of Manchester) - Presenting Author
Professor Ian Brunton-Smith (Surrey University)
Dr Jose Pina Sanchez (University of Leeds)
Dr David Buil Gil (University of Manchester)

Getting an accurate picture of the true extent of crime is an essential task as they are used to determine the costs of crime, which is in turn used to allocate resources to the police and public services, performance manage the police, and evaluate crime reduction initiatives. In the UK, two sources of crime data form the basis for these counts, police recorded crime figures and victim data from the annual Crime Survey of England and Wales. But the veracity of these data sources has been repeatedly questioned. In this paper we build on recent developments in the study of measurement error to better understand the flaws in the measurement of crime. We use a modified Multi-Trait Multi-Method (MTMM) model to augment police recorded crime data with individual-level survey data and estimate the ‘true’ extent of crime. Our modified model deals appropriately with the additional sampling errors inherent in the survey data but not in police records, allowing us to more accurately quantify the extent and form of measurement error affecting each crime source. We find that police recorded crime underestimates true crime measures but is more reliable than surveys. Personal crimes are least accurately measured, although this may be changing over time.

The use of register data in the provision of high-quality survey data – The use case of the Austrian Eurograduate survey

Mr Franz Astleithner (Statistics Austria)
Dr Lena Seewann (Statistics Austria) - Presenting Author

The Eurograduate survey 2022 of higher education graduates carried out by Statics Austria is an inspiring example of how to fruitfully combine register with survey data: 1) The sample was drawn from the official database of educational pathways at Statistics Austria which comprises all students and graduates for every year. 2) Contact data of these graduates derived from the higher education institutions (or overarching service institutions) was validated with data from the Central Population Register. 3) We can make use of all pseudonymized information of the registers hosted at Statistics Austria at individual level for purposes of weighting . Next to the database of educational pathways we exploit data about employment status for the weighting. 4) Finally, the processed and pseudonymized data will be hosted at the Austrian Micro Data Centre (AMDC) at Statistics Austria, where research institutions can link the survey data on pseudonymized individual level to all other data sources that include the relevant pseudonymized personal identifier. A well-designed data-privacy strategy allows detailed analysis but guarantees that no identification of persons is possible.
With this close interplay of registers and survey data, we were able to achieve a return rate of above 55 per cent. Furthermore, we have well calibrated data due to a precise weighting database and the potential for analyses far beyond the scope of the survey.
In our presentation, we give an overview of the whole process of data generation, we will show some insights on the potential challenges of the linkage of various data sources and provide analysis on deviations between survey and register data. Overall, our example shows the potential that National Statistical Institutes have in the provision of comprehensive high-quality databases for scientific purposes due to the combination of register and survey data while meeting highest standards in data protection.