ESRA logo

ESRA 2023 Program


All time references are in CEST

Metadata uplift, machine learning and sustainable methods for metadata curation

Session Organisers Mr Jon Johnson (CLOSER, UCL Social Research Institute)
Dr Suparna De (University of Surrey, Department of Computer Science,)
TimeThursday 20 July, 16:00 - 17:30
Room U6-02

The establishment of cross-European infrastructures, (European Question Bank, CESSDA Data Catalogue, SSHOC) and standards (DDI, European Language Social Science Thesaurus, Triple) to support FAIR data in the social sciences and humanities, will have a significant impact on the level, quality and interoperability requirements of metadata from studies to support discovery and reuse for both legacy data and future data collections.

Whilst there has been significant progress in the development of technical architectures and the establishment of standards, generating high quality content remains a challenge particularly in ealy capture of lifecycle metadata and the development of suitable training datasets.

Machine learning and allied technologies offer the possibility to assist both studies and infrastructures to uplift existing metadata and provide new automated methods to curate future metadata to sustain FAIR data infrastructures.

In this session we will explore the latest developments in automated and semi-automated metadata curation, to support FAIR data, reuse and interoperability.

Keywords: Metadata, Machine Learning, Computational Social Science


Initial findings from the automation of extraction of metadata from questionnaires and its classification

Mr Jon Johnson (CLOSER, Social Research Institute, UCL)
Dr Suparna De (Dept. Computer Science, University of Surrey) - Presenting Author

Social science archives have a long history of producing well documented datasets which include the provenance (questionnaires), data description and methodological annotation. Alongside that recent efforts to create thesauri such as ELSST which can be used systematically across the social sciences provide the possibility for enriching these valuable assets created over the last 50 years. However, this information is currently available mostly as PDFs alongside deposited datasets.

The presentation will show preliminary findings from a project between CLOSER and the University of Surrey which has used the metadata held in CLOSER Discovery ( to explore the automation of extraction of provenance data from PDFs of questionnaires, and the classification of the questions and associated data to a subset of ELSST.

The project has used four supervised model architectures (Multinomial naive Bayes, LSTM, ULMFit, and BERT) and their enhancements, to explore the strengths of the models, for metadata extraction and its utility for classification, in a number of different social science and health domains. This has provided valuable insights both for the most suitable methods and the composition of training data which would be needed to reliably extract metadata from questionnaires and classify the questions and associated data to a suitable ontology.

Thematic Exploration of Interview Materials using NLP tools

Dr Judit Gárdos (Research Documentation Centre, Centre for Social Sciences) - Presenting Author
Dr András Micsik (Department of Distributed Systems, Institute for Computer Science and Control )
Dr Julia Egyed-Gergely (Research Documentation Centre, Centre for Social Sciences)
Ms Enikő Meiszterics (Research Documentation Centre, Centre for Social Sciences)
Mr Balázs Pataki (Department of Distributed Systems, Institute for Computer Science and Control )
Ms Róza Vajda (Research Documentation Centre, Centre for Social Sciences)
Ms Anna Horváth (Research Documentation Centre, Centre for Social Sciences)

Our paper analyzes how thematic exploration and sharing of qualitative research data can be achieved in a multi-language setting. European social science research is a highly interconnected field owing to numerous international funding schemes, yet it is, at the same time, also a fragmented one, due to limitations posed by language barriers. Most prominently, non-English qualitative research data, such as interviews, cannot be used in a way suggested by the FAIR principles of data sharing. FAIR principles describe and prescribe technical characteristics of research archives, while the actual sharing of data produced in smaller languages - an issue recently tackled in some projects - is still unresolved. Our presentation is about the use of artificial intelligence in generating, managing and processing qualitative social science data. It describes the phases and analyzes the results of a project realized by social scientists and IT professionals to facilitate the scrutiny and research of collected interview texts at the Research Documentation Centre of the Centre for Social Sciences in Budapest, Hungary.

Our project aimed to obtain a thematic overview of lengthy interviews and interview collections. To this end, we identified and tested the most promising NLP tools supporting the Hungarian language. As a first step, a suitable domain-oriented taxonomy was created to classify semi-structured texts covering a wide range of topics; we describe the considerations and processes involved in executing this task. As a result of adapting and customizing the European Language Social Science Thesaurus (ELSST), a concise, hierarchical structure was produced including relevant topics in social sciences in Hungary. This served as a basis of building a manually annotated gold standard for the automated indexing of topics. We compare the efficiency of various automated indexing methods, and describe the implementation of a researcher tool supporting custom visualizations and faceted search.