All time references are in CEST
Safe research data sharing: disclosivity and sensitivity issues 1
|Session Organisers|| Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, UCL Social Research Institute)
Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, UCL Social Research Institute)
Dr Deborah Wiltshire (GESIS – Leibniz Institute for the Social Sciences)
|Time||Tuesday 18 July, 11:00 - 12:30|
A core activity of many surveys is the safe provision of well-documented and de-identified survey data, linked administrative data and geographical data to the research community. Data sharing is based on the consent given by the participants and is conditional on the assurance that confidentiality and GDPR rights will be protected. Breaking this assurance would constitute an ethical violation of consent and would threaten the trust that survey participants place in the team who collects their data and may affect their willingness to participate in further data collections.
Data sharing policies and applications are generally overseen by Data Access Committees. Data releases are either managed by the studies themselves, or by national repositories. The choice of data access routes usually depends on the disclosivity and sensitivity of the data. Data are considered disclosive if there are concerns over the re-identification of individuals, households, or organisations by a motivated intruder, and are considered sensitive if they fall under the GDPR definition of “special category data”, which require additional protection. Disclosive and sensitive data require a higher degree of security and are generally only available in secure sharing platforms, such as local secure servers or Trusted Research Environments (TREs).
The aim of this session is to create a space to share ideas and techniques on data access and how to address the risk of disclosivity and sensitivity. We invite colleagues to submit ideas relating to:
• Data sharing routes for survey and linked data
• Methods of disclosure control prior to data sharing
• Methods of risk assessment of disclosivity and sensitivity
• Data classification policies and sharing agreements
• Technical tools used to generate bespoke datasets
• Trusted Research Environments / Secure Labs: remote vs in-person access
• Syntax sharing and reproducibility
• International data sharing
Papers need not be restricted to these specific examples.
Keywords: sharing, disclosivity, sensitivity, safe access, disclosure control
Dr Paulo Matos Serodio (University of Essex) - Presenting Author
Professor Tarek Al Baghal (University of Essex)
Dr Curtis Jessop (NatCen )
Dr Shujun Liu (Cardiff University)
Professor Luke Sloan (Cardiff University)
Professor Matthew Williams (Cardiff University)
Social media corpora are increasingly used in social science research, albeit rarely accompanied by survey data. One key challenge in linking survey and social media data, alongside getting consent to linkage from the respondents, is creating valuable metrics that summarise respondents’ social media activity without unmasking their identities. In this paper, we outline a framework for developing social media metrics that can be combined with survey data while also: (1) minimizing the risk of disclosure of respondents’ identity; (2) producing insightful metrics that both contrast and enhance the information obtained from the survey.
Leveraging data from the Innovation Panel of the UK Household Longitudinal Survey, which asked for respondents’ consent to link their survey responses to their Twitter data, we propose a systematic and transparent approach to generating summary information of Twitter activity across a number of dimensions that can be linked to survey data.
A second challenge we address is providing sufficient breadth in the features we extract from respondents’ Twitter data such that is the metrics are useful across disciplinary boundaries, reflecting the multi-dimensional nature of the survey.
Overall, the paper proposes a framework and documents the process through which Twitter data are summarised using approaches including natural language processing tools to extract aggregate features at the user-level that can enhance the survey data corpora. We expect this framework will be useful for researchers looking to publish linked survey and Twitter data, and its principles applied to other forms of digital trace.
Ms Tatjana Mika (Researcher) - Presenting Author
Survey data are increasingly often linked with administrative data in order to enlarge data content and data quality. Record linkage is thereby assumed to increase reliability especially concerning past events like short term unemployment and past events like labour market en-try. Other fields of application are subjects, which are difficult to report for respondents like gross income.
However, the access to linked survey-record data is often restricted due to data security con-cerns. These concerns restrict the access often to the use of the data in safe rooms or to a limited number of persons directly employed for a specific research project. A convincing concept of anonymisation can hence improve access for more researchers if the result is a Scientific Use File. Giving the high costs of record linkage and survey data this is a very de-sirable goal.
The ongoing project “SHARE-RV”, which started in 2008-2009, asks the German participants of the international Survey on Health Ageing and Retirement (SHARE) survey to agree to a link of their SHARE interview with data from their pension insurance record. The project has continued since 2009 and published several Scientific Use Files combining survey data with record data, which can be ordered for the use in universities and scientific institutions.
A similar concept was applied to the record data linked to the German Socio-Economic Panel (GSOEP). The linked survey-records data are also published to be used off site as scientific use file.
The presentation offers insights in the process of anonymisation for linked survey-record data published as Scientific Use Files.
Dr Janete Saldanha Bach (GESIS – Leibniz Institute for the Social Sciences) - Presenting Author
The paper addresses privacy and personal data protection requirements within research data management (RDM) practices. It analyzes research data sharing and reuses approaches from a privacy and protection perspective, considering the enforcement compliance within technological and policy guidance on sharing and reuse practices. The research infrastructures also require trustworthiness procedures and, most importantly, harmonizing those policies since data are interoperable and exchangeable among countries. Since the interoperability of Research Data components is crucial for data reuse purposes, privacy and data protection harmonized policies should enhance the approach for future activities regarding interoperability. To conduct this analysis, I rely on the concepts of the FAIR principles and the guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. Due to the more regulated environment, on the one hand, and the other hand, the increasing data-driven Science, how to balance privacy enforcements, literate scientists on privacy and personal data protection and digital skills, and foster open Science in the meantime? From this analysis, a harmonized policy framework on data sharing and reuse related to personal data is an outcome. I provide a Framework that supports convergence on regulation, best practices, and skills to align the data protection procedures of research institutions or independent researchers. The results address the recognized needed guideline on the ethical use of qualitative, sensitive, confidential data and the most satisfactory approaches for data licensing, policies, usage restrictions, cybersecurity, and digital ethics development within the research data landscape.
Dr Deborah Wiltshire (GESIS-Leibniz Institute for the Social Sciences) - Presenting Author
Not all microdata can be anonymised without losing too much detail. For some data, once sufficient detail is removed to make it anonymous, much of its utility is lost. Therefore, pseudonymized data, data that is not fully anonymised, is increasingly made available. Under data protection legislation (GDPR), these data are considered ‘personal data’ and require appropriate safeguards.
There are many positive arguments for making pseudonymized data available– they expand the scope of the research possible, contributing to vital policy-related research and allowing data to be linked together. In the post-pandemic era, their role has been even more important.
Trusted Research Environments (TREs) play an integral role in enabling safe access to sensitive data. In the earlier years of secure access, Safe Rooms – secured, physical locations where researchers could access and analysis these data - were the predominant access route.
Safe Rooms have considerable advantages, not least because of the ability for secure data services to control almost all factors. Safe Rooms have one significant drawback – the burden on researchers to travel, sometimes long distances, to work at a specific location, a burden not all researchers are able to meet equally. This has led to exploring remote access options. The pandemic which led to a lengthy shutdown of Safe Room data access, has further pushed this agenda forward.
The move towards easier, more flexible remote access options is a popular one with researchers but it comes with a dilemma for secure data access facilities – how to manage the differential risks of the different access routes. The 5 Safes Framework has been widely used to structure the decision-making processes in TREs.
This presentation explores how it can now be utilized in managing the differential risks of shifting to new access routes, using the example of the Secure Data Center at