ESRA logo

ESRA 2023 Glance Program

All time references are in CEST

Survey Data Harmonisation 2

Session Organisers Dr Ruxandra Comanaru (European Social Survey, City University, London UK)
Ms Daniela Negoita (European Values Study, Tilburg University, The Netherlands)
TimeTuesday 18 July, 14:00 - 15:30
Room U6-01c

The harmonisation of survey data has been a burgeoning research strand within social sciences over the last few years. Harmonisation has at its core the attempt to make data more comparable, allowing for data linkage and analysis of distinct datasets that were not initially meant to be assessed together. Sound methodology in data harmonisation allows comparisons and harmonisation of instruments ahead of the data collection. as well as evaluations of data from various sources that were not initially meant to be compared. As such, survey data can be harmonised ex-ante (during their design or after fieldwork before issuing the collected data) or ex-post (combining surveys not meant to be compared at the point of data collection). Some countries have made concerted efforts to harmonise instruments at the point of data collection (ex-ante), such that, for example, the wellbeing of the nation can be tracked in all national statistics surveys, pushing for the same measures to be used regularly to assess the same concepts. The latter approach to data harmonisation (i.e., ex-post) has been used to attempt to tease out insights from sources not designed to be compared a priori. The SUSTAIN 2 (WP2, Task 1) project, for example, tried to bridge data from two long-standing surveys in the European social science context: the European Values Study (EVS) and the European Social Survey (ESS). It aimed to harmonise their data over several decades in order to allow cross-survey and cross-national comparisons, and thus to link measures that have conceptual, and potentially statistical, overlap.
This session, which aims at offering practical prompts for future venues for cooperation between surveys, invites contributions on all aspects and challenges of data harmonisation: data collection mode, sampling design, translation method, and measurement instruments.

Keywords: data harmonisation, linear stretching, European Social Survey, European Values Study


Connecting and Harmonizing Empirical Social Science Research in Societal Crises

Mr Andrés Saravia (WZB) - Presenting Author
Professor Stefan Liebig (FU Berlin)
Professor Cordula Artelt (LIfBi)
Professor Thorsten Faas (FU Berlin)
Professor Monika Jungbauer-Gans (DZHW)
Professor Anja Strobel (TU Chemnitz)
Professor Mark Trappmann (IAB)

The spread of the Covid-19 pandemic triggered a multitude of research projects, heterogeneous in methodology and data quality, collecting empirical data on the social impact of the pandemic. This development led the German Data Forum (RatSWD), an advisory council to the German government, composed of ten elected representatives from the empirical social, behavioural and economic sciences and ten representatives from data production, to establish a working group that aimed at connecting and harmonizing this research. In response to the Russian attack on Ukraine in February 2022 and the associated emergence of another crisis with enormous political, economic and societal impacts, the working group has broadened its focus to include the topic of the Ukraine war and refugee movements to Germany.

The working group has since then initiated networking to foster interoperability of social, behavioural, educational and economic research projects across methodological boundaries. The working group has collected information on 303 empirical studies in Germany, indexed them by keywords and made this information available to the research community (

In a second step, the working group has compiled a standard questionnaire (to be published in the RatSWD Output Series), to enable interoperability between studies. This questionnaire consists of socio-demographics, general crisis related items and crisis specific items on the Covid-19-pandemic, the Ukraine war and climate change.

In a third step, we want to broaden the focus and develop an international exchange of researchers who collect data on the societal impact of crises.

In our presentation, we will give an overview of the results of the working group and discuss starting points for an international exchange in the survey research community.

Revising the Statistic of German Adult Education Centres: How to assess and communicate issues of time-series comparability in a long-term panel study?

Dr Kerstin Hoenig (German Institute for Adult Education - Leibniz Centre for Lifelong Learning ) - Presenting Author
Dr Verena Ortmanns (German Institute for Adult Education - Leibniz Centre for Lifelong Learning )

One of the largest quantitative data sources on adult education in Germany is the Statistic of German Adult Education Centres (German Volkshochschulen; AEC). AEC are publicly funded institutions that offer a broad range of courses and educational activities for the general public. The panel study of all German AEC provides annual data on institutional level capturing information on staff, finances and expenditures, course offerings, participation rates, cooperations with other organisations, and other activities such as exhibitions or symposia. These data are available for a long time series, since 1987, and the annual response rate of the online survey exceeds 98%. The data are used by researchers as well as policy makers and AEC staff. The latter user groups mainly rely on published reports about the data and on-demand data analyses conducted by AEC Statistic staff.
To update and improve the AEC Statistic a large revision of the questionnaire took place in 2017/2018. These changes restrict the comparability of many items and the related variables before and after the revision and thus, the time series is interrupted. The problem is exacerbated by the onset of the COVID-19 pandemic shortly after the revision. This results in two major tasks: Firstly, we have to decide on data harmonization strategies. Secondly, we have to communicate these to different groups of data users, from researchers who are familiar with more advanced techniques and want to manage and analyse the data themselves to policy makers and AEC staff who rely on our data management, analysis and interpretation of results. This presentation shows some examples from the questionnaire and discusses possible harmonization and communication strategies.

Harmony: Development and use of a Natural Language Processing tool to facilitate measurement harmonisation across studies

Dr Bettina Moltrecht (University College London) - Presenting Author
Dr Eoin McElroy (University of Ulster)
Dr Mauricio Hoffmann (Federal University of Santa Maria)
Mr Thomas Wood (fastdatascience)

Integrative epidemiological and intervention research has been hindered by inconsistent approaches to the measurement of common mental health problems. For instance, reviews have estimated that over 280 questionnaires have been used to measure depression. One crucial approach to addressing this is the harmonisation of questionnaires, i.e. identifying similar question items that tap into the same symptom from different scales, and testing their measurement properties and equivalence empirically – thus enabling researchers to compare and combine findings across existing studies, even when different measures have been administered. Successful harmonisation and thus pooling of data allows not only for greater statistical power and more refined subgroup analysis, but also enhances generalizability of findings and the capacity to compare, cross validate data and findings from different countries. We first present our ongoing work of harmonising mental health measures across five different UK cohort studies. Secondly, we demonstrate the development and use of a new AI-driven tool “Harmony” ( that allows researchers to quickly and efficiently compare and match survey items across multiple studies. Thirdly, we show a research case-example using the Millennium Cohort Study and the Brazil High Risk Cohort study to harmonise items and investigate how factors of social connection impact anxiety and depression symptoms in young people across the two countries. Lastly, we want to invite the audience to discuss challenges around data harmonisation and how the Harmony tool can support them with this.

Biased Bivariate Correlations in Insufficiently Harmonized Survey Data

Dr Ranjit K. Singh (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author

Many projects in the social sciences make it necessary to combine data from different sources. That may mean data gathered in different survey modes, different survey programs, or using different survey instruments. Often, we need to perform ex-post harmonization to improve comparability of the source data before combining it to form a homogenous integrated data product.
In this talk, I will focus on one such comparability issue and demonstrate a consequence of insufficient harmonization. Specifically, I look at the case where two instruments (or modes) lead to different item difficulties: This means if we applied the two instruments (or modes) to the same population, we would get different mean responses. If such mean differences are not mitigated before combining data, we introduce a mean bias into our composite data. Such mean bias has direct consequences for analyses based on the combined data. In data drawn from the same population, mean bias introduces error variance. In data drawn from different populations it would bias or even invert true population differences. However, in this paper I demonstrate that mean bias can also bias bivariate correlations which involve the affected variables. If differences in item difficulty are not mitigated before combining data, we introduce a variant of Simpson’s paradox into our data: The bivariate correlation in each source survey might differ substantially from the correlation in the composite dataset. In a set of simulations, I demonstrate this correlation bias effect and show how it changes depending on the mean biases in each source variable and the strength of the underlying true correlation.