ESRA 2023 Glance Program

All time references are in CEST

Tools and program developments for data analysis
Session Organiser	Ms Xiaoyao Han (DIW Berlin)
Time	Friday 21 July, 09:00 - 10:30
Room	U6-07

The session “Tools and Program Developments for Data Analysis” will cover the latest tools and programs required for effective data analysis. Having the right tools and programs is critical to the success of data analysis. By streamlining the analysis process, ensuring accuracy, and making it easier to collaborate, data analysis tools can help organizations and individual users make better use of their data and gain valuable insights. The session will discuss various tools for data analysis, highlighting their benefits in terms of improving efficiency, accuracy, scalability, visualization, and FAIR principles. Several tools will be introduced: including a) Open Data Format that includes enriched metadata and can be used across various software programs, b) Amnesia data anonymization tool that facilitates the trusted sharing of research data while protecting privacy, c) methodological approaches and data visualization tools focusing on producing and disseminating model based early estimates of key health outcomes, d) SurveyHarmonies, a tool that enables the creation of ex-ante harmonized, multi-language surveys using reproducible research tools compliant with the DDI standards in the R statistical environment, e) visualizing survey data using hammock plots. These tools are developed for various domains and aim to analyse datasets, automate tasks, and visualize results more efficiently. The session will provide a valuable opportunity to communicate the latest developments in data analysis tools and programs.

Keywords: FAIR data, data visualization, data analysis tools

Papers

Visualizing survey data with hammock plots

Professor Matthias Schonlau (University of Waterloo) - Presenting Author

Visualizing survey data is important for exploratory data analysis. Many survey variables are categorical rather than continuous, and tools other than the scatter plot matrix are needed. I will review hammock plots and related plots (e.g. Sankey, alluvial). I will then use hammock plots to visualize changes of variables longitudinally, missing values, multiple imputations and finding coding errors. Hammock plots are implemented in Stata (hammock), Python (hammock), and R (ggparallel).

Identifying and Developing Methodological Approaches and Data Visualization Tools to Produce and Disseminate Model Based Early Estimates of Key Health Outcomes

Dr Lauren Rossen (US National Center for Health Statistics)
Dr Morgan Earp (US National Center for Health Statistics) - Presenting Author
Mr Priyam Patel (US National Center for Health Statistics)
Dr Chris Moriarity (US National Center for Health Statistics)
Dr Diba Khan (US National Center for Health Statistics)

Timely and granular estimates of important health outcomes are essential for public health surveillance, research, and decision-making. The US National Center for Health Statistics (NCHS) is developing and centralizing a set of tools and processes for disseminating early provisional small area estimates of key health outcomes, using data from the National Vital Statistics System (NVSS) and NCHS surveys. Small area estimation (SAE) and related methodologies are widely used to produce estimates for small geographic units (e.g. counties) and small populations (e.g., Hispanic subgroups, American Indian or Alaska Native population). Nowcasting methods and other model-based approaches can be used to produce timelier estimates of key health outcomes when data are incomplete. Using a combination of SAE and nowcasting methods, NCHS is in the process of building a Model Based Early Estimation program to produce timelier and granular estimates of key health outcomes using a variety of NCHS data systems.

A metadata enriched Open Data Format across Statistical Programs

Ms Caludia Saalbach (DIW Berlin)
Ms Xiaoyao Han (DIW Berlin) - Presenting Author
Mr Knut Wenzig (DIW Berlin)

In the social sciences, a diversity of statistical software is used for data processing and analysis. These software programs have specific data formats and handle metadata in different ways. Proprietary data and a variety of data formats that are only partially compatible present obstacles to data reuse by researchers. In particular, proprietary data formats undermine the requirement for interoperability embodied in the FAIR principles. To address this problem, this paper proposes the concept of an open data format, which is intended to facilitate data dissemination in social sciences and to suport data analysis across software. The open data format features multilingual metadata and externaö links to data portals that allow direct access to online documented material through the statistical software itself. In addition, with technical support, the data format can be loaded and manipulated in popular statistical software, at the same time the metadata can be fully transferred and used. This paper begins by describing the specification of the open data format including data and metadata, which lays the foundation for the potential subsequent use of metadata and the programs based on it. In addition, we explore several technical implementations of statistical software in which data formats can be imported and manipulated together with metadata.

SurveyHarmonies: Creating ex ante harmonised, multi-language surveys using DDI-compliant reproducible research tools in the R statistical environment

Professor Adrian Dusa (University of Bucharest)
Mr Daniel Antal (Reprex BV)
Dr James Edwards (SINUS Markt- und Sozialforschung GmbH) - Presenting Author

Survey recycling and harmonisation can help mitigate the challenges of data collection during societal crises: reusing validated questions can expedite questionnaire development and reduce cognitive burdens on respondents, while harmonising with open data can allow for shorter questionnaires. However, harmonisation remains both an uncommon competence and a laborious process. This paper introduces an open-source toolkit that automates large parts of the survey harmonisation workflow. Its innovative functions respond to key questions in reproducible research, for instance: how can unit tests for millions of records be created and documented by computer, in ways that can be easily reviewed by humans? And, how can such procedures be made user-friendly not just for large surveying institutions, but also stakeholders like small universities, CSOs, and polling and market research SMEs? Our toolkit is based on the R packages retroharmonize, which enables standardised and documented recoding of variables, labels, and metadata with the help of s3 classes, and DDIwR, which converts data files to and from R, DDI, SPSS, Stata, SAS, and Excel. It also integrates two new packages: declared, which identifies empty and declared missing values, and dataset, which translates DDI Codebook and Lifecycle properties into R objects and adds them to data files. In January 2023, this toolkit will be used to run a multinational survey of music professionals within the EU-funded SurveyHarmonies project (MusicAIRE Grant No. d9ce8c1e-18ef-4ec1-b732-f59e0d818df1). After a technical introduction to the toolkit, this paper presents the SurveyHarmonies findings, harmonising and comparing original data on music professionals’ perceptions of gender inequality and corruption with open data on such perceptions within the general population. The paper concludes by discussing the transfer potential of harmonised surveying tools to the broader field of cultural sociology, as demonstrated within the upcoming OpenMusE project (Horizon Europe Grant No. 101095295).

ESRA 2023 Glance Program

Tools and program developments for data analysis

Papers

Visualizing survey data with hammock plots

Identifying and Developing Methodological Approaches and Data Visualization Tools to Produce and Disseminate Model Based Early Estimates of Key Health Outcomes

A metadata enriched Open Data Format across Statistical Programs

SurveyHarmonies: Creating ex ante harmonised, multi-language surveys using DDI-compliant reproducible research tools in the R statistical environment