All time references are in CEST
Evaluating Quality of Alternative Data Sources to Surveys 1
|Session Organiser|| Dr Paul Beatty (U.S. Census Bureau)
|Time||Wednesday 19 July, 09:00 - 10:30|
Alternative data sources (including but not limited to administrative records, passively collected data, and social media) are increasingly seen as potential complements to, and in some cases replacements for, self-reported survey data. These alternatives are especially appealing given rising costs and declining response rates of traditional surveys. Incorporating them into datasets may improve completeness and timeliness of data production, reduce response burden for respondents, and in some cases provide more accurate measurements than self-reports.
Nevertheless, the sources and extent of measurement errors within these alternative data sources, as well as reasons for missingness, are not always well understood. This can create challenges when combining survey and alternative data, or when substituting one for the other. For example, apparent changes over time could be artifacts due to record system characteristics rather than substantive shifts, and reasons for exclusion for administrative records may differ from reasons to decline survey participation. A clear understanding of the various errors and their consequences is important for ensuring validity of measurement.
This session welcomes papers on methodology for evaluating quality of alternative data sources, either qualitative or quantitative; that explore the causes, extent, and consequences of errors in alternative data sources; or that describe case studies illustrating challenges or successes in producing valid data sets that combine survey and alternative data sources.
Keywords: alternative data sources, administrative records, data quality
Dr Michael Ochsner (FORS) - Presenting Author
The use of social media and the advancement in high dimensional analyses enabled researchers to use big data to analyse social phenomena. Posts and Tweets can be analysed using different methods to study not only networks and communication structures but, using text analysis tools such as sentiment analysis, also topics that would fall broadly under the topic of public opinion. Twitter is used to predict voting outcomes, gender attitudes, spread of news, and especially in the context of the Covid19 pandemic, vaccine uptake, spread of conspiracy theory and many more. Twitter is also more and more used as a source for journalists or even for the evaluation of research or journal articles regarding societal impact.
However, while the use of social media and its technics comes with many opportunities and is an interesting field of study, whenever social media data is used regarding describing the public opinion, a question becomes relevant: who uses Twitter, who posts on Twitter and how does that relate to the general population? Surprisingly little studies have been conducted on that matter, the existing studies use online panel data to investigate the spread of use and provide only limited data to describe the users, such as age and gender.
In this presentation, I tackle the question of how Twitter and other social media users represent the general population. Using data from MOSAiCH2021, a national register-based high-quality web survey in Switzerland including the ISSP and Swiss-specific questions, I use an encompassing framework for the analysis of representation bias to test whether Twitter and other social media users differ from the general population regarding several health-related and political topics. The focus on the analysis of potential representation bias lies on real-life research questions, being able to distinguish between Twitter users and non-users.
Mr Luke Taylor (Kantar Public) - Presenting Author
Administrative data can be very valuable sources of information for researchers. However, it is important to be aware of potential limitations.
This paper explores the different data quality issues that can affect administrative data and compares this to the quality of data that can be generated from survey samples. Energy Performance Certificate (EPC) data for domestic buildings in England and Wales is used to illustrate potential problems that may affect all administrative data and to discuss possible mitigations.
As part of the EPC scheme, professional assessors visit properties that are being constructed, sold, or let to conduct an energy efficiency assessment. They record information about the property, the heating system(s) used, the current energy efficiency, and recommendations for improvements. The data is open access to researchers (https://epc.opendatacommunities.org/).
In evaluating quality, I focus on three potential sources of bias.
Coverage: what risk of bias is there if the administrative data does not cover the whole population? How do we quantify this risk and are there any mitigations that we can take? EPC data is evaluated against robust external benchmarks to illustrate how these research questions can be addressed.
Accuracy: How accurate is the information available from administrative databases? To explore this issue, EPC data has been linked to survey data to analyse consistency of information at common variables (property type and mains gas connection). Where there are inconsistencies, open access data has been used to further assess the accuracy of each source.
Timeliness: How timely is the data and what risk does this pose to inferences? Does this vary between variables? The issue date of EPC certificates is used to quantify how up to date the information is likely to be. I explore if limiting analysis to recent records improves accuracy, but consider the impact on coverage.
Miss Deji Suolang (University of Michigan) - Presenting Author
Inflation expectations and consumers’ confidence have been at the heart of the economy, and timely knowledge of inflation expectation dynamics is paramount for monetary policy. This study intends to explore whether Twitter data can supplement and provide preliminary estimates for more costly and time-consuming nationally representative surveys. I identified relevant keywords and scraped around 1.4 million tweets generated between November 2019 and November 2022. It covers a 3-year time frame that starts from a few months before COVID. Data mining techniques such as frequent N-grams, sentiment analysis, time series analysis, and regression models are employed to generate a set of daily Twitter-based indexes for consumers' sentiments, price dynamics, and inflation expectations. The estimates from Twitter data are then compared to the analogous monthly index in Survey of Consumers, a nationally representative survey conducted by the Survey Research Center at the University of Michigan. The results indicate that Twitter data provide direct measures of consumers’ economic confidence. Similar trends and directions of the changes are reflected by the survey self-reports, but the magnitude of the changes is different. Twitter-based price dynamics index is echoing the monthly survey index. Finally, each of the Tweet-based indexes is combined in a polynomial regression model estimating changes in inflation expectations in reference surveys, and predictions are made for the next year based on the same model. They suggest that Twitter-based estimates can potentially be a much cheaper and timelier data source. The study is concluded with recommendations for future research and practice for similar studies evaluating the quality of alternative data sources to surveys.
Dr Richard Silverwood (Centre for Longitudinal Studies, UCL Social Research Institute, University College London) - Presenting Author
Dr Nasir Rajah (Centre for Longitudinal Studies, UCL Social Research Institute, University College London)
Professor Lisa Calderwood (Centre for Longitudinal Studies, UCL Social Research Institute, University College London)
Professor Bianca De Stavola (Population, Policy & Practice Department, UCL Great Ormond Street Institute of Child Health, University College London)
Professor Katie Harron (Population, Policy & Practice Department, UCL Great Ormond Street Institute of Child Health, University College London)
Professor George Ploubidis (Centre for Longitudinal Studies, UCL Social Research Institute, University College London)
Recent years have seen an increase in linkages between survey and administrative data. It is important to evaluate the quality of such data linkages to discern the likely reliability of ensuing research. Evaluation of linkage quality and bias can be conducted using different approaches, but many of these are not possible when there is a separation of processes for linkage and analysis to help preserve privacy, as is typically the case in the UK (and elsewhere). In this paper we describe a suite of generalisable methods to evaluate linkage quality and target population representativeness of linked survey and administrative data which remain tractable when users of the linked data are not party to the linkage process itself. We emphasise issues particular to longitudinal survey data throughout. Our proposed approaches cover several areas: i) Linkage rates, ii) Selection into response, linkage consent and successful linkage, iii) Linkage quality, and iv) Linked data target population representativeness. We illustrate these methods using a recent linkage between the 1958 National Child Development Study (NCDS) and Hospital Episode Statistics (HES) databases. NCDS is a cohort following the lives of an initial 17,415 people born in Great Britain in a single week of 1958, while HES contains important information regarding admissions, accident and emergency attendances and outpatient appointments at NHS hospitals in England. Our findings suggest that the linkage quality of the NCDS-HES data is high and that the linked sample maintains an excellent level of target population representativeness with respect to the single dimension we were able to assess. Through this work we hope to encourage providers and users of linked data resources to undertake and publish thorough evaluations. We further hope that providing detailed illustrative analyses using linked NCDS-HES data will improve the quality and transparency of research using this resource.
Mr Mao Li (PhD student) - Presenting Author
From the start of data collection for the 2020 US Census, official and celebrity users tweeted about the importance of everyone being counted in the Census and urged followers to complete the questionnaire. At the same time, social media posts expressing skepticism about the Census became increasingly common. This study distinguishes between different Twitter prototypical user groups and investigates their possible impact on self-completion (online) rate for the 2020 Census, according to Census Bureau data. Using a network analysis method, Community Detection, and a clustering algorithm, Latent Dirichlet Allocation (LDA), three prototypical user groups were identified: "Official Government Agency," "Census Advocate," and "Census Skeptic." The prototypical Census Skeptic user was motivated by events (e.g., "Republicans in Congress signal Census cannot take extra time to count") about which an influential person had tweeted. This group became the largest one over the study period. The prototypical Census Advocate was motivated more by official tweets and was more active than the prototypical Census Skeptic. The Official Government Agency user group was the smallest of the three, but their messages – primarily promoting completion of the Census -- seemed to have been amplified by Census Advocates, especially celebrities and politicians. We found that the daily size of the Census Advocate user group – but not the other two – predicted the 2020 Census online self-completion rate within five days after a tweet was posted. This finding suggests that the Census social media campaign was successful in promoting completion, apparently due to the help of Census Advocate users who encouraged people to fill out the Census and amplified official tweets. This finding demonstrates that a social media campaign can positively affect public behavior regarding an essential national project like the Decennial Census.