ESRA logo
Tuesday 14th July      Wednesday 15th July      Thursday 16th July      Friday 17th July     




Friday 17th July, 09:00 - 10:30 Room: HT-103


Technical Problems and Solutions for Record Linkage and Big Data 1

Convenor Dr Manfred Antoni (Institute for Employment Research (IAB) )
Coordinator 1Mr Stefan Bender (Institute for Employment Research (IAB))
Coordinator 2Professor Rainer Schnell (University of Duisburg-Essen)

Session Details

The scope of the session includes technical issues of linkage, handling large administrative databases or big data (for example, blocking strategies) and problems caused by incomplete identifiers. Furthermore, techniques and problems of privacy preserving record linkage and secure access to linked datasets will be discussed. Finally, new algorithms and software for record-linkage applications for large datasets will be covered.

We invite presentations on:
• Handling missing and messy identifiers
• Blocking techniques
• Privacy Preserving Record Linkage
• Access to linked datasets
• Algorithms and Software

Paper Details

1. Privacy-Preserving Distance Comparable Geospatial Encoding
Dr James Farrow (Farrow Norris)
Professor Rainer Schnell (University of Duisburg-Essen)

A privacy preserving method of encoding location without using explicit coordinates is presented which allows encoded information to be compared to determine the distance between locations to a desired level of accuracy without the need to encode explicit location data. An discussion of the tradeoff between encoding size, desired accuracy and maximum calculable distance is presented.

The approach is suitable for allowing calculations on geospatial information, e.g. address information, where individual locations must not be readily identifiable for privacy reasons but where records may need to be compared to obtain their distance from one another or from other features.


2. Recent advances in Privacy Preserving Record Linkage
Professor Rainer Schnell (University of Duisburg-Essen)
Mr Christian Borgs (University of Duisburg-Essen)

Privacy preserving record linkage (PPRL) is an academic field dedicated to the study of techniques for linking surveys and/or administrative databases without the use of unique personal identifiers. During the last decade, a number of different PPRL techniques have been suggested and a few of them are actually in use for large scale surveys. The presentation will explain the basis approaches, their advantages and disadvantages concerning performance and cryptographic properties. Based on recent research, recommendations for the practical implementation of PPRL for large data sets with millions of records and missing identifiers will be given.


3. Linking the SED to administrative data: technical challenges
Dr Joshua Tokle (American Institutes for Research)
Ms Christina Jones (American Institutes for Research)
Dr Michelle Yin (American Institutes for Research)

I will describe work conducted by the American Institutes for Research to link the NSF's Survey of Earned Doctorates (SED) to UMETRICS data, which is an administrative data set that contains payments to employees on federal grants for particfipating universities. Data were linked using a standard Fellegi-Sunter approach. I will describe our process for preparing and linking the data, including our success using Python, MySQL, and an in-house Java implementation of the Fellegi-Sunter algorithm.