ESRA 2019 Draft Programme at a Glance
Linked administrative data and applications for evidence building 2
|Session Organisers|| Dr Asaph Young Chun (U.S. Census Bureau)
Dr Manfred Antoni (Institute for Employment Research (IAB), Germany)
|Time||Thursday 18th July, 14:00 - 15:30|
The survey data linked to multiple data sources, such as administrative records and big data, are increasingly at the heart of evidence-based policy making across continents. We call the multiply linked data as "pandata" (Chun and Scheuren, 2011). This session presents papers that demonstrate how multiple sources of data linked together are instrumental to evidence-based policy making. In the vein of papers published in a Wiley book (Chun, Larsen, Durant, and Reiter, forthcoming 2019), "Administrative Records for Survey Methodology," this session will discuss linked administrative data papers that address the following topics of substantive applications or methodological research:
- Papers demonstrating the use of administrative records linked to survey data in developing or evaluating public policy. For example, how administrative data linked to survey data have informed the policy making process to bring about social and economic benefits that were not possible to research by relying on traditional survey data alone?
- Substantive census applications where administrative data are linked and transformed into information that is useful and relevant to policy making.
- Papers utilizing dynamic data visualization involving the data linked by multiple data sources to communicate the public policy issues.
- Papers involving experimental design, such as randomized controlled trials, to advance evidence-based policy making with case studies in economics, education, and public health.
- Recent methodological advancements in linking administrative data with survey data (one-to-one) or with multiple sources of data (one-to-many).
- Papers applying Bayesian approaches to using linked administrative data in order to produce information that is useful and relevant to key sectors of health, economy and education.
Keywords: administrative records, evidence-based policymaking, linked data, multiple data sources
Privacy-preserving Linkage of Encrypted Survey and Administrative Numerical Data using Distance-preserving Bloom Filters
Professor Rainer Schnell (University of Duisburg-Essen) - Presenting Author
Mr Philip Höcker (University of Duisburg-Essen)
Mr Christian Borgs (University of Duisburg-Essen)
Linking data on natural persons under the jurisdiction of the European GDPR usually requires encrypting all identifying information.
An example is linking survey data to administrative records of different organizations. Encrypting identifiers while preserving error-tolerant data linkage requires using Privacy-preserving Record Linkage (PPRL) techniques.
A widely used similarity-preserving encryption in PPRL is the Bloom Filter approach.
Currently, numerical identifiers in Bloom Filters can only be compared by string similarity measures. However, these are neither distance- nor order-preserving. Recently, Christen et al. (2016) proposed an extension to the method. This approach allows for numerical sorting and calculating distances on encrypted data. The idea consists in mapping a generated interval centered at the numerical value of interest into a Bloom Filter. The amount of overlap between two intervals as a proxy for the numerical distance is estimated by the intersection of the Bloom Filters.
We test this technique using several data sets. Distance-preserving properties and linkage quality results will be reported. The new method allows the computation of distances and order but is highly dependent on correct parameter choices for the intervals. Some best-practice recommendations will be given.
“What’s My Wage Again?” – Comparing Survey and Administrative Data to Validate Earning Measures
Mrs Britta Gauly (GESIS) - Presenting Author
Mrs Jessica Daikeler (GESIS)
Dr Tobias Gummer (GESIS)
Professor Beatrice Rammstedt (GESIS)
Survey data remain an influential source for researchers in social science. Thus, the quality of these data has been an important topic for survey methodologists for several years. One issue is the existence of measurement error in recorded values which is particular present in sensitive questions, such as income. Also from a content perspective, income measurement error is a crucial topic as it obscures true economic relationships (Bound and Krueger, 1989). Recent results by Drechsl-Grau et al. (2015) show that depending on the data source (survey vs. administrative data) different economic relationships are estimated.
The present study aims at advancing our knowledge on measurement error in wage questions by exploiting the potentials of data linkage. We use data from the German sample of the Programme for the International Assessment of Adult Competencies (PIAAC) which was linked with administrative data provided by the German Federal Employment Agency.
We consider administrative data as “true wages” and define measurement error as the difference between administrative and survey data. In a first step, we test the correlation between measurement error and true earnings. In a second step, we run separate regressions using survey and administrative wage as dependent variables and test whether the regression coefficients of variables that are widely used as controls in wage equations, differ significantly.
Our results show that measurement error is significantly related to true wages and thus “non-classical”. In addition, we find statistically significant differences in the coefficients of our control variables when using either administrative or survey wages as the dependent variable. This suggests that findings from survey data can lead to biased results and misguided political interventions. Learning more about size and type of measurement error might help to correct for existing biases or to improve the quality of surveys, e.g. by using tailored wage question designs.
Optimal Probabilistic Record Linkage: Best Practice for Linking Employers in Survey and Administrative Data
Dr John Abowd (US Census Bureau and Cornell University)
Dr Joelle Abramowitz (University of Michigan)
Professor Margaret Levenstein (University of Michigan) - Presenting Author
Dr Kristin McCue (US Census Bureau)
Mr Dhiren Patki (University of Michigan)
Professor Trivellore Raghunathan (University of Michigan)
Ms Ann Rodgers (University of Michigan)
Professor Matthew Shapiro (University of Michigan)
Dr Nada Wasi (Bank of Thailand)
This paper illustrates an application of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across firms is highly assymetric. To address these difficulties, this paper uses a supervised machine learning model to probabilistically link survey respondents in the Health and Retirement Study (HRS) with employers and establishments in the Census Business Register (BR) to create a new data source which we call the CenHRS. Multiple imputation is used to propagate uncertainty from the linkage step into subsequent analyses of the linked data. The linked data reveal new evidence that survey respondents’ misreporting and selective nonresponse about employer characteristics are systematically correlated with wages.
Incomplete data - an approach to compensate nonresponse in income data with administrative data and imputation methods
Ms Nadine Bachbauer (research assistant) - Presenting Author
The individual income is regarded as sensitive issue and missing values as well as incorrect information in income data constitute a persistent problem of survey data. Different mechanisms like social desirability, memory errors and motivational factors can cause missing data. Incomplete income data can lead to reduced data quality and thus entail biased estimation results. Imputation approaches are a well-established method to close gaps caused by missing values. However, imputation can only provide an estimation and not necessarily the true value. Moreover, its quality depends on whether assumptions of the imputation model hold. NEPS-SC6-ADIAB, a linked data product by the Institute of Employment Research (IAB) and the Leibniz Institute for Educational Trajectories (LIfBi) incorporating administrative data of the IAB and survey data of the National Educational Panel Study (NEPS) constitutes the database for this research project. I examine if the multiple imputation of NEPS earnings achieve results comparable to the administrative earnings, which are considered to contain the true value. Comparative analyses of the imputed earnings and administrative earnings should show whether the quality of the survey earnings has improved as a result of multiple imputation. First, I identified nonresponse in earnings in the NEPS data and a generated nonresponse dummy is used to analyse if the nonresponse behaviour of the survey respondents is missing at random (MAR). Therefore, I estimate a model including the nonresponse behaviour as dependant variable and the independent variables consist of variables usually used to impute earnings. To identify the underlying process of missingness a further model additionally contains the administrative earnings.
Preliminary results show that earnings has no significant effect if all relevant survey variables are included. This can be interpreted as evidence that earnings nonresponse follows a MAR-process.The final comparison between the administrative earnings and the imputed earnings will show if there are significant difference concerning the distribution.
Selectivity in the record linkage process – A big issue?
Dr Dina Frommert (German Pension Insurance) - Presenting Author
Mrs Christin Czaplicki (German Pension Insurance)
Dr Thorsten Heien (Kantar Public)
Mr Marvin Krämer (Kantar Public)
The combination of data from different sources offers many benefits. From a methodological point of view however, any data linkage introduces new sources of error into the process of data gathering. As far as the linkage process is concerned, this could be bias inherent to the new or alternative data source, bias introduced by the technical process of data linkage, or, if the active consent of the respondents is needed, another form of nonresponse error.
For the German study “Life courses and pension provisions (LeA)”, extensive survey data on life courses and pension provisions in different pension schemes are linked with administrative data from the respondents’ individual pension accounts. Because of the strict data protection rules in Germany, every respondent is required to consent to the data linkage; otherwise, the administrative data may not be extracted and linked to the survey data. In practice, this means the respondents need to sign a consent form during the interview, which is a lot to ask from respondents who have already consented to take part in a lengthy survey. For the first time, LeA provided the respondents with an electronic consent form, so that the effort of signing and sending off the form was considerably reduced.
The proposed paper will report the consent rates for specific subgroups of the population, e.g. Germans, migrants, employees and civil servants. We will also examine socio-demographic characteristics of consenters and non-consenters and estimate the amount of selectivity introduced to the sample by the record linkage. To this end, we estimate a logit model to test the inclination to consent of different subgroups.
Due to high consent rates in LeA, we do not expect noteworthy additional selectivity. Nevertheless, the results demonstrate that consent varies between different groups. Reducing the effort to consent can play an important role in reducing selectivity.