All time references are in CEST
Post-survey data: curation and technical data management
|Session Organisers|| Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, UCL Social Research Institute)
Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, UCL Social Research Institute)
|Time||Wednesday 19 July, 11:00 - 12:30|
Post-survey data management processes are a crucial part of the survey lifecycle. These processes ensure the curation, de-identification, documentation, maintenance, and production of high-quality data for research purposes and safe dissemination. Close collaboration with data scientists and other users is needed to understand the requirements of the research community. The data from some surveys may include not just responses to questions but biomedical data and cognitive test results. Data may be collected via multiple modes and in the case of longitudinal surveys will be collected at multiple time points.
Data management setups should ideally be based on robust data management policies, secure data handling standards and well-defined technical protocols. This enables a coherent approach to data quality, efficient workflows, generation of metadata and rapid adaptation to new projects (e.g. Covid-19).
The tools for survey data management are usually written scripts tailored to the survey design. This creates opportunities for programming in statistical software (e.g. SPSS, Stata SAS) or scripting languages (R, Python), as well as long-term storage in databases (e.g. MS SQL Server, PostgreSQL, MySQL, MongoDB).
The aim of this session is to create a space to share information, ideas and techniques on post-survey data curation and management. We invite colleagues with data management responsibilities to submit ideas relating to:
• Technical protocols or bespoke scripts for in-house processing of survey data, paradata, and metadata
• Automated survey data processing and its reproducibility on other survey data
• Data quality assurance and validation techniques and syntax for the identification of data inconsistencies or errors
• Implementation of ETL (extract, transform and load) data workflows
• Techniques to set up surveys from the outset (e.g CAI) to facilitate smooth post-processing
• Use of databases to store and manage data and metadata
• Sharing processing syntax
Keywords: data quality, processing, survey, workflow, database, syntax, metadata
Mrs Catherine Yuen (ISER, University of Essex) - Presenting Author
Understanding Society is the UK Household Longitudinal Study. The Study is based at the Institute for Social and Economic Research at the University of Essex. We follow our participants over a long period of time, giving us a long-term perspective on people’s lives. As a longitudinal study, Understanding Society helps us explore how life in the UK is changing and what stays the same over many years.
The sample size for the Study is large, allowing researchers to investigate the experiences of different sub-groups and ethnic minorities over time.
Understanding Society is built on the successful British Household Panel Survey (BHPS) which ran from 1991-2009 and had around 10,000 households in it. Understanding Society started in 2009 and interviewed around 40,000 households, including around 8,000 of the original BHPS households. The inclusion of the BHPS households allows researchers and policy makers to track the lives of these households from 1991.
Understanding Society is part of CLOSER, a group which brings together world-leading longitudinal studies to maximise their use, value and impact and improve the understanding of key social and biomedical challenges.
Questionnaires and datasets from the CLOSER studies can be searched and browsed using the CLOSER Discovery search engine. This is an online platform that allows users to explore the content of multiple UK longitudinal studies. CLOSER Discovery is regularly updated to include content from each longitudinal study.
We interview participants every year and each wave run over two years. Data is delivered from our fieldwork agency once every quarter (8 quarters per wave) and every quarter we process two different waves of data. This presentation will be focused on how we automate our processes and what we do to ensure data consistency and longitudinal consistency. It will also include how we code strings and convert IDs for de-identification before exporting to researchers.
Mr Chandra Shekhar ROY (Bangladesh Bureau of Statistics) - Presenting Author
Mr Mohammad Harun Or RASHID (Labcom Technology)
Abstract: Bangladesh starts journey focusing on ICT with a view to realize the vision of ‘Smart Bangladesh 2041’. Thus, at the long run census and survey data can be used exhaustively in the planning process to transform into Smart Bangladesh by 2041. In Bangladesh, several types of official data are released under the Statistics Act. Bangladesh Bureau of Statistics (BBS) played a significant role in the field of historical Statistical data preservation. After the independence of Bangladesh in 1971, there was a rich repository of statistical Microdata in IBM 360 to ES/9000 model mainframe tapes dating back to the late 1970s and early 2000s.. Almost 8600 nine-track ½ inch spool tapes were used to preserve those data. Recently BBS has converted all those data from EBCDIC format to ASCII format. Near about 165 data set recovered which has been declared as digital Asset. The overarching objective is to strengthen the prevailing national statistical archiving system. The BBS will be made available on this large volume of converting data to the citizens of Bangladesh as well as global, so that academic and scholarly debates can take place taking cognizance of historical data. Since independence, 2,391 of BBS surveys and census publications have been digitized and converted to the e-book system. The overarching objective is to strengthen the prevailing national statistical archiving system. By revisiting time series data, it is hoped that well-informed and meticulous policies can be designed and formulated in the future. Most fundamentally, availability and easy accessibility to such a large volume of Big Data will inspire reassessing economic theories and indicators of development informing Bangladesh’s position in the global rankings like Sustainable Development Goals (SDG). Alternate back-up has been established for data Preservation at a distance of 200 km from NSO headquarters.
Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, University College London) - Presenting Author
Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, University College London)
Post-survey data management processes are a key part of the survey lifecycle and involve the curation, de-identification, documentation, maintenance, and production of high-quality data for research purposes and safe dissemination.
The Centre for Longitudinal Studies (CLS) is home to four national longitudinal cohort studies, which follow the lives of tens of thousands of people in the UK. CLS manages the survey data collection, linkage with data from external administrative organisations, data management and data sharing.
Data collected by the CLS longitudinal cohort studies include responses to multiple surveys, biomedical data and cognitive test results. Data are collected via multiple modes at multiple time points.
We will describe the setup of the CLS Research Data Management (RDM) team and the technical work required at the different stages of the CLS survey lifecycle. We will also provide an overview of the technical data management tools (PostgreSQL/Python) and the pipelines followed to share data safely with the international research community.
Dr Vilma Agalioti-Sgompou (Centre for Longitudinal Studies, University College London) - Presenting Author
Ms Maggie Hancock (Centre for Longitudinal Studies, University College London)
Dr Aida Sanchez-Galvez (Centre for Longitudinal Studies, University College London)
Survey datasets include the data (the survey responses) and the metadata (e.g. question text, response options, and the values’ description). Survey data are usually managed in social research using statistical packages such as SPSS, SAS, Stata, or more recently R. However, for research data management there is another dimension that needs to be taken into account: the systematic and structured management of how these datasets are related to each other. Managing relationships between data is crucial for the effective day-to-day technical data management tasks and provision of data deliverables for research and operational purposes.
This talk will present a case of in-house PostgreSQL/Python software tools developed by the team of data managers/engineers that work on longitudinal survey data at the UCL Centre for Longitudinal Studies (CLS). CLS is home to four national longitudinal cohort studies, which follow the lives of tens of thousands of people in the UK. The cohort data have been collected in multiple time points, with surveys collected through multiple modes, biomedical data, cognitive tests and data linkages still ongoing. These longitudinal data, which are very rich and highly complex, are stored and managed in a large relational database managed by CLS.
We will present a variety of data pipelines produced to handle the data (ingest, export, recoding, deriving variables, checking data consistency).
The presentation will also cover the PostgreSQL relational database design and case studies of specific data management tasks conducted through this software. We will include code snippets from the syntax written in Python (SQLAlchemy and Pandas) and PostgreSQL to steer discussion on survey data management handling.
We will also present the lessons learnt (pros and cons) of the survey data management in this technical setting.