All time references are in CEST
Safe research data sharing: disclosivity and sensitivity issues 2
| Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, UCL Social Research Institute)
Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, UCL Social Research Institute)
Dr Deborah Wiltshire (GESIS – Leibniz Institute for the Social Sciences)
|Wednesday 19 July, 09:00 - 10:30
A core activity of many surveys is the safe provision of well-documented and de-identified survey data, linked administrative data and geographical data to the research community. Data sharing is based on the consent given by the participants and is conditional on the assurance that confidentiality and GDPR rights will be protected. Breaking this assurance would constitute an ethical violation of consent and would threaten the trust that survey participants place in the team who collects their data and may affect their willingness to participate in further data collections.
Data sharing policies and applications are generally overseen by Data Access Committees. Data releases are either managed by the studies themselves, or by national repositories. The choice of data access routes usually depends on the disclosivity and sensitivity of the data. Data are considered disclosive if there are concerns over the re-identification of individuals, households, or organisations by a motivated intruder, and are considered sensitive if they fall under the GDPR definition of “special category data”, which require additional protection. Disclosive and sensitive data require a higher degree of security and are generally only available in secure sharing platforms, such as local secure servers or Trusted Research Environments (TREs).
The aim of this session is to create a space to share ideas and techniques on data access and how to address the risk of disclosivity and sensitivity. We invite colleagues to submit ideas relating to:
• Data sharing routes for survey and linked data
• Methods of disclosure control prior to data sharing
• Methods of risk assessment of disclosivity and sensitivity
• Data classification policies and sharing agreements
• Technical tools used to generate bespoke datasets
• Trusted Research Environments / Secure Labs: remote vs in-person access
• Syntax sharing and reproducibility
• International data sharing
Papers need not be restricted to these specific examples.
Keywords: sharing, disclosivity, sensitivity, safe access, disclosure control
Mr Daniel Krähmer (LMU Munich) - Presenting Author
Ms Laura Schächtele (LMU Munich)
Dr Andreas Schneck (LMU Munich)
Transparency and peer control are cornerstones of good scientific practice and entail the replication and reproduction of findings. Scientific accountability, however, rests on the premise that researchers make their data and research code publicly available. To this day, sharing research material seems to be the exception rather than the rule among social scientists.
To investigate which specific factors increase or inhibit researchers’ willingness to share code, we conducted a field experiment sending code requests to 1206 corresponding authors who have published research articles based on data from the European Social Survey between 2015 and 2020. In this experiment, we randomly varied three aspects of our email’s wording in a 4x2x2 factorial design: the appeal why researchers should share their code (FAIR principles, academic altruism, prospect of citation, no information), the perceived effort of code sharing (no code cleaning required, no information) and our request’s overall framing (enhancement of social science research, response to replication crisis). This experiment has been preregistered on OSF.
Contrary to our hypotheses, only framing affected researchers’ code sharing behavior: Scientists who received the negative wording alluding to the replication crisis were significantly more likely to share their research code. Overall, 385 researchers (37.5% of eligible cases) responded positively to our request, mirroring the degree of code availability in other disciplines.
Our study makes three key contributions. First it provides a large-scale aggregate assessment of code sharing behaviour among social scientists using survey data. Second it pins down the causal influence of specific factors on code sharing. Third it unveils rather alarming anecdotal evidence on the challenges researchers face when sharing code (e.g., poor code management). Together, these results point towards the dire need for institutional solutions regarding code availability.
Ms Beate Lichtwardt (UK Data Service/UKDA, University of Essex)
Dr Deborah Wiltshire (GESIS - Leibnitz Institute for the Social Sciences) - Presenting Author
Due to their sensitive nature, access to secure/controlled data is highly restricted, they are only accessible remotely via Trusted Research Environments (TREs) or on-site via dedicated Safe Rooms.
Internationally, researchers often face significant hurdles in terms of time and financial burden having to travel to a Safe Room, a hurdle not many could overcome. In recent years, work has intensified to enable data access across international borders (IDAN, SSHOC) via Safe Room Remote Access. One of the achievements of SSHOC WP 5.4 was to set up a bilateral agreement between the UK Data Service and GESIS, Germany allowing a remote connection between the two services.
Remote connections offer a safe environment to access confidential data. Datasets remain on the secure servers of the data provider (in location A) and are accessed via a secure encrypted internet connection (from location B) where all analysis is done. No physical transfer of the data ever occurs. The Safe Room at location B provides additional physical controls (e.g. Safe Room access and monitoring).
This agreement allows researchers in Germany to access a range of secure datasets made available by the UKDS SecureLab from the GESIS Safe Rooms in Germany and vice versa. The agreement is available as a template to help facilitate the setting up of further remote connections. Once the legal agreements were in place, the technical set-up started. Thin Clients were configured, exchanged between partners and installed in the Safe Room of the other service. Technical tests were carried out, followed by researcher tests.
Our talk will showcase the connection set up UK Data Service - GESIS and highlight some of the key challenges as well as solutions which we found along the way.
Mr Andy Boyd (University of Bristol) - Presenting Author
Ms Robin Flaig (University of Edinburgh)
Dr Jacqui Oakley (University of Bristol)
Mr Richard Thomas (University of Bristol)
Dr Katharine Evans (University of Bristol)
Mr Matthew Crane (University of Bristol)
Ms Kirsteen Campbell (University of Edinburgh)
Dr Stela McLachlan (University of Edinburgh)
Dr Emma Turner (University of Bristol)
The ‘Longitudinal Health and Wellbeing National Core Study’ uses Longitudinal Population Study (LPS) data to inform COVID-19 research. A centralised infrastructure for LPS was needed to systematically link participants’ routine records and provide secure researcher access.
The UK Longitudinal Linkage Collaboration (UKLLC) is a pan-UK and interdisciplinary Trusted Research Environment (TRE). It provides remote access to pooled LPS data using a ‘Secure eResearch Platform’. UKLLC’s design was informed by LPS data managers and public contributors. To access participants’ records it needs to demonstrate effective disclosure controls in order to gain regulatory approvals and maintain the study-participant trust relationship.
Over 20 LPS have joined UKLLC, with >280,000 Participants' self-reported data linked to NHS records (primary, secondary, COVID-19; civic registers; prescriptions; mental health) and environmental exposure data (pollution, green space, neighbourhood indicators). Permissions are in place to link socio-economic records (earnings, employment, social benefits, education). Our governance model is based on the ‘Five Safes’ and UK Anonymisation Network’s ‘Anonymisation Decision Making Framework’ (ADF) to assess and control disclosure risk. Our system enables the complete separation of processing participants’ identifiers for linkage purposes from de-identified attribute data for analysis. Disclosure risk is assessed using the ADF which informs protocols to de-identify data prior to entering the TRE. Following analysis, analysts submit anonymous aggregate statistical findings and reusable research outputs (syntax, derived data, documentation) for disclosure review assessments. Public/participant involvement guides our approach to understanding data sensitivity and risk tolerance.
UKLLC provides a unique strategic research-ready platform for longitudinal research. It offers a scalable solution for rapid access to sensitive data from multiple data owners whilst maintaining confidentiality through controlling disclosure risk. It is compliant with key UK regulations needed to access linked routine records.
Dr Janete Saldanha Bach (GESIS – Leibniz Institute for the Social Sciences) - Presenting Author
The paper addresses privacy and personal data protection requirements within research data management (RDM) practices. It analyzes research data sharing and reuses approaches from a privacy and protection perspective, considering the enforcement compliance within technological and policy guidance on sharing and reuse practices. The research infrastructures also require trustworthiness procedures and, most importantly, harmonizing those policies since data are interoperable and exchangeable among countries. Since the interoperability of Research Data components is crucial for data reuse purposes, privacy and data protection harmonized policies should enhance the approach for future activities regarding interoperability. To conduct this analysis, I rely on the concepts of the FAIR principles and the guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. Due to the more regulated environment, on the one hand, and the other hand, the increasing data-driven Science, how to balance privacy enforcements, literate scientists on privacy and personal data protection and digital skills, and foster open Science in the meantime? From this analysis, a harmonized policy framework on data sharing and reuse related to personal data is an outcome. I provide a Framework that supports convergence on regulation, best practices, and skills to align the data protection procedures of research institutions or independent researchers. The results address the recognized needed guideline on the ethical use of qualitative, sensitive, confidential data and the most satisfactory approaches for data licensing, policies, usage restrictions, cybersecurity, and digital ethics development within the research data landscape.
Ms Tatjana Mika (Researcher) - Presenting Author
Survey data are increasingly often linked with administrative data in order to enlarge data content and data quality. Record linkage is thereby assumed to increase reliability especially concerning past events like short term unemployment and past events like labour market en-try. Other fields of application are subjects, which are difficult to report for respondents like gross income.
However, the access to linked survey-record data is often restricted due to data security con-cerns. These concerns restrict the access often to the use of the data in safe rooms or to a limited number of persons directly employed for a specific research project. A convincing concept of anonymisation can hence improve access for more researchers if the result is a Scientific Use File. Giving the high costs of record linkage and survey data this is a very de-sirable goal.
The ongoing project “SHARE-RV”, which started in 2008-2009, asks the German participants of the international Survey on Health Ageing and Retirement (SHARE) survey to agree to a link of their SHARE interview with data from their pension insurance record. The project has continued since 2009 and published several Scientific Use Files combining survey data with record data, which can be ordered for the use in universities and scientific institutions.
A similar concept was applied to the record data linked to the German Socio-Economic Panel (GSOEP). The linked survey-records data are also published to be used off site as scientific use file.
The presentation offers insights in the process of anonymisation for linked survey-record data published as Scientific Use Files.