All time references are in CEST
Safe research data sharing: disclosivity and sensitivity issues 1
|Session Organisers|| Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, UCL Social Research Institute)
Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, UCL Social Research Institute)
Dr Deborah Wiltshire (GESIS – Leibniz Institute for the Social Sciences)
|Time||Tuesday 18 July, 14:00 - 15:30|
A core activity of many surveys is the safe provision of well-documented and de-identified survey data, linked administrative data and geographical data to the research community. Data sharing is based on the consent given by the participants and is conditional on the assurance that confidentiality and GDPR rights will be protected. Breaking this assurance would constitute an ethical violation of consent and would threaten the trust that survey participants place in the team who collects their data and may affect their willingness to participate in further data collections.
Data sharing policies and applications are generally overseen by Data Access Committees. Data releases are either managed by the studies themselves, or by national repositories. The choice of data access routes usually depends on the disclosivity and sensitivity of the data. Data are considered disclosive if there are concerns over the re-identification of individuals, households, or organisations by a motivated intruder, and are considered sensitive if they fall under the GDPR definition of “special category data”, which require additional protection. Disclosive and sensitive data require a higher degree of security and are generally only available in secure sharing platforms, such as local secure servers or Trusted Research Environments (TREs).
The aim of this session is to create a space to share ideas and techniques on data access and how to address the risk of disclosivity and sensitivity. We invite colleagues to submit ideas relating to:
• Data sharing routes for survey and linked data
• Methods of disclosure control prior to data sharing
• Methods of risk assessment of disclosivity and sensitivity
• Data classification policies and sharing agreements
• Technical tools used to generate bespoke datasets
• Trusted Research Environments / Secure Labs: remote vs in-person access
• Syntax sharing and reproducibility
• International data sharing
Papers need not be restricted to these specific examples.
Keywords: sharing, disclosivity, sensitivity, safe access, disclosure control
Dr Marieke Heers (FORS, Swiss Centre of Expertise in the Social Sciences) - Presenting Author
Dr Brian Kleiner (FORS, Swiss Centre of Expertise in the Social Sciences)
Dr Alexandra Stam (FORS, Swiss Centre of Expertise in the Social Sciences)
The sharing of data and related materials is more and more required from scientific journals. As such, reproducibility of research and scientific analyses is becoming increasingly important for survey researchers, where they must make available the various materials used for their scientific articles. These materials are usually shared via repositories and include the data collection instruments, the data themselves, a proper documentation, as well as the syntaxes used for the analyses.
The benefits to researchers of sharing data and related materials linked to scientific publications are considerable, including greater visibility of one’s own research and data, as well as reinforced trust and confidence in one’s conclusions. Further, some universities are moving towards including data citation as part of the assessment of research impact, and there is evidence that articles for which data are shared are more frequently cited.
However, researchers often struggle in practice to share their data and materials. Challenges include issues concerning proper informed consent, anonymisation, copyright, adequate documentation, and data security. In addition, data citation practice, which allows readers to link from the article to the data in a repository, is often unclear, with little guidance from journals.
We put forward that the data services of repositories or universities have a key role to play in this regard, since they often provide the support and tools to researchers so that they can properly share their data and related materials. Their activities often cover the full research cycle and, amongst others, relate to questions of anonymisation, data citation, and documentation. This contribution will aim at fostering a discussion of the current challenges facing survey researchers regarding the sharing of data and related materials, as well as the needed forms of support that could be brought to bear or developed further by data services.
Mr Urs Fichtner (Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg) - Presenting Author
Dr Lukas Horstmeier (Institute of Medical Biometry and Statistics, Section of Health Care Research and Rehabilitation Research, Faculty of Medicine and Medical Center – University of Freiburg)
Dr Boris Bruehmann (Institute of Medical Biometry and Statistics, Section of Health Care Research and Rehabilitation Research, Faculty of Medicine and Medical Center – University of Freiburg)
Mr Manuel Watter (Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg)
Professor Harald Binder (Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg)
Mr Jochen Knaus (Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg)
One of the currently debated changes in scientific practice is the implementation of data sharing requirements for peer-reviewed publication to increase transparency and intersubjective verifiability of results. Therefore, both funding agencies and scientific journals try to promote the publication of research data. However, it seems that data sharing is a not fully adopted behavior among researchers. The Theory of Planned Behavior was repeatedly applied to explain drivers of data sharing from the perspective of data donors (researchers). Furthermore, data sharing can also be understood as disclosure of personal information, e.g. from the perspective of survey participants. This study aimed to answer the following questions:
1 Is participants non-response affected by the information about the sharing of the collected data?
2 Is participants response behavior affected by the information about the collected data to be shared?
We applied a mixed methods approach, consisting of a qualitative pre-study and a quantitative survey including an experimental component. Latter was a two-group setup with an intervention group (A), receiving the information that data will be shared publicly, and a control group (B). The survey included questions on views and experiences regarding data sharing. A list-based recruiting of members of the Medical Faculty of the University of Freiburg was applied for 15 days. For exploratory data analysis of dropouts and non-response, we used Fisher’s exact tests and binary logistic regressions.
In sum, we recorded 197 cases for Group A and 198 cases for Group B. We found no systematic group differences regarding response bias or dropout, indicating no major effect of the information that the collected data will be shared publicly. Furthermore, we gained insights on the experiences, our sample made with data sharing: half of the sample already requested data of other researchers or shared data on request of other researchers. Data repositories, however, were used less frequently: 28% of our respondents used data from repositories and 19% stored data in a repository.
Survey response bias, data sharing, dropout rate, researcher behavior, data publication
Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, University College London) - Presenting Author
Ms Claudia Yogeswaran (Centre for Longitudinal Studies, University College London)
Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, University College London)
Sharing survey data for research purposes must be governed by principles and procedures that seek to be fair, open, and transparent. There is a balance to be drawn between maximising the use of the research data and minimising risks to the rights of participants.
The UCL Centre for Longitudinal Studies (CLS) is home to four national longitudinal cohort studies, which follow the lives of tens of thousands of people in the UK. The data collection, linkage, management and sharing is based on the consent given by the participants.
CLS has established a data sharing programme that aims at ensuring that the CLS data are as widely available as possible to the research community (nationally and internationally), whilst guaranteeing that: i) sensitive and/or disclosive data are managed and shared in a secure manner; ii) the legal requirements, ethical guidelines, and moral responsibility to the study participants are maintained; and iii) the consent agreements given by the cohort members are complied with. Attempts to re-identify individuals in the research data is always forbidden.
In this paper we will describe the CLS tiered data classification, which determines the most appropriate data sharing route and licencing needed. The main criteria for data categorisation are sensitivity and disclosivity risk, which are assessed in depth by the CLS data management team. The four CLS data tiers are: 1) Tier 1a: safeguarded data of low sensitivity and a small residual disclosivity risk; 2) Tier 1b: special safeguarded data of slightly elevated sensitivity and a small residual disclosivity risk; 3) Tier 2: controlled access data of high sensitive nature or with a significant disclosivity risk; Tier 4: controlled access data with a very high level of sensitivity and/or disclosivity risk. Data from tiers 2 and 3 must be accessed from Trusted Research Environments.
Dr Paulo Matos Serodio (University of Essex) - Presenting Author
Professor Tarek Al Baghal (University of Essex)
Dr Curtis Jessop (NatCen )
Dr Shujun Liu (Cardiff University)
Professor Luke Sloan (Cardiff University)
Professor Matthew Williams (Cardiff University)
Social media corpora are increasingly used in social science research, albeit rarely accompanied by survey data. One key challenge in linking survey and social media data, alongside getting consent to linkage from the respondents, is creating valuable metrics that summarise respondents’ social media activity without unmasking their identities. In this paper, we outline a framework for developing social media metrics that can be combined with survey data while also: (1) minimizing the risk of disclosure of respondents’ identity; (2) producing insightful metrics that both contrast and enhance the information obtained from the survey.
Leveraging data from the Innovation Panel of the UK Household Longitudinal Survey, which asked for respondents’ consent to link their survey responses to their Twitter data, we propose a systematic and transparent approach to generating summary information of Twitter activity across a number of dimensions that can be linked to survey data.
A second challenge we address is providing sufficient breadth in the features we extract from respondents’ Twitter data such that is the metrics are useful across disciplinary boundaries, reflecting the multi-dimensional nature of the survey.
Overall, the paper proposes a framework and documents the process through which Twitter data are summarised using approaches including natural language processing tools to extract aggregate features at the user-level that can enhance the survey data corpora. We expect this framework will be useful for researchers looking to publish linked survey and Twitter data, and its principles applied to other forms of digital trace.
Dr Deborah Wiltshire (GESIS-Leibniz Institute for the Social Sciences) - Presenting Author
Not all microdata can be anonymised without losing too much detail. For some data, once sufficient detail is removed to make it anonymous, much of its utility is lost. Therefore, pseudonymized data, data that is not fully anonymised, is increasingly made available. Under data protection legislation (GDPR), these data are considered ‘personal data’ and require appropriate safeguards.
There are many positive arguments for making pseudonymized data available– they expand the scope of the research possible, contributing to vital policy-related research and allowing data to be linked together. In the post-pandemic era, their role has been even more important.
Trusted Research Environments (TREs) play an integral role in enabling safe access to sensitive data. In the earlier years of secure access, Safe Rooms – secured, physical locations where researchers could access and analysis these data - were the predominant access route.
Safe Rooms have considerable advantages, not least because of the ability for secure data services to control almost all factors. Safe Rooms have one significant drawback – the burden on researchers to travel, sometimes long distances, to work at a specific location, a burden not all researchers are able to meet equally. This has led to exploring remote access options. The pandemic which led to a lengthy shutdown of Safe Room data access, has further pushed this agenda forward.
The move towards easier, more flexible remote access options is a popular one with researchers but it comes with a dilemma for secure data access facilities – how to manage the differential risks of the different access routes. The 5 Safes Framework has been widely used to structure the decision-making processes in TREs.
This presentation explores how it can now be utilized in managing the differential risks of shifting to new access routes, using the example of the Secure Data Center at