ESRA logo

ESRA 2023 Preliminary Glance Program

All time references are in CEST

Analyzing Open-Ended Questions in Survey Research 2

Session Organiser Dr Alice Barth (Department of Sociology, University of Bonn)
TimeWednesday 19 July, 11:00 - 12:30
Room U6-20

Open-ended questions in surveys provide information on respondents’ personal perspectives, their interpretation and understanding of concepts. In addition, respondents get an opportunity for individualized feedback. Open-ended questions are almost indispensable in collecting data on issues that are too diverse for standardized questions, e.g. job designations. While many researchers used to refrain from open-ended questions due to the arduous process of transcribing and coding responses, the rise of computer-assisted data collection and software solutions for analyzing large amounts of text data can now accelerate the work process. Nevertheless, in dealing with open-ended questions, researchers need a specific methodological toolbox that is adapted to analyzing unstructured text data.

This session aims at discussing methods for processing and analyzing responses to open-ended questions. Topics of particular interest are, for example,
- coding responses to open-ended questions (manual, semi-automated or automated coding)
- text mining / natural language processing approaches
- qualitative content analysis of responses
- data quality and mode effects in open-ended questions
- open-ended probes as a means of evaluating comparability, validity and comprehension of questions
- analyzing respondent feedback on the survey process
- using information from open-ended questions to complement or contradict results from standardized questions
We are looking forward to contributions that highlight the methodological and/or substantial potential of open-ended questions in survey research.

Keywords: open-ended questions; data quality; text mining; text data; content analysis

A Semi-Automated Nonresponse Detector (SANDS) model for open-response data

Dr Kristen Hibben (US National Center for Health Statistics) - Presenting Author

Open-ended survey questions or web probes can be valuable because they allow respondents to provide additional information without the constraints of predetermined closed-ended options. However, open-text responses are more prone to item nonresponse, as well as inadequate and irrelevant responses. Further, time and cost factors associated with processing large sets of qualitative data often hinder the use of open-text data.

To address these challenges, we developed the Semi-Automated Nonresponse Detector (SANDS) model that draws on recent technological advancements in combination with targeted human-coding. The model is based on a Bidirectional Encoder Representation from Transformers model, fine-tuned using Simple Contrastive Sentence Embedding. This powerful approach uses state-of the-art natural language processing, as opposed to previous nonresponse detection approaches that have relied exclusively on rules or regular expressions, or bag-of-words, which tend to perform less well on short pieces of text, typos, or uncommon words.

In this presentation, we present our process of training and refining the model. We summarize the results of an extensive evaluation process including the use open-text responses from a series of web probes as case studies, comparing model results against human-coded source of truth or hand-reviewed random samples, and sensitivity and specificity calculations to quantify model performance. Open-text web probe data are from the Research and Development Survey During COVID-19 surveys created by the National Center for Health Statistics. NORC collected the data in multiple rounds of the survey in 2020 and 2021 using a probability-based panel representative of the US adult English-speaking non-institutionalized population. We also present results of analyses to detect potential bias in the model and possible implications, information about how the model can be accessed by others, and best practice tips and guidelines.

Qualitative data collected online - a trend with consequences? Analysis of the answers to open-ended questions.

Dr Daniela Wetzelhütter (University of Applied Sciences ) - Presenting Author

The possibility of generating a large number of data sets quickly and inexpensively by means of online surveys is increasingly being used to also inquire about - supposedly "simple" - qualitative aspects. Thus, respondents are increasingly confronted with the task of formulating and writing down answers themselves. Experience shows that this "task" often leads to a large amount of paraphrased statements - i.e. partly short, abbreviated answers ("SMS style") - while formulated sentences or paragraphs occur comparatively less frequently.
The characteristics of these answers are - following the logic of the survey process (previous steps influence subsequent ones) - decisive for the quality of the coding. Attributing a more consistent manageability to the "more detailed" answers in the course of coding and classifying shorter answers as rather difficult can lead to problems in the training of coders. This is because: short statements can be concise and thus unambiguous and, conversely, wordy statements can be imprecise and thus difficult to code. This means that the number of letters or words does not automatically indicate a high or low quality of answers and thus of coding. Based on 667 open-ended responses to an open-ended question (resulting in 14344 codings by 13 coders), the paper addresses the challenge of identifying responses that (may) lead to consistent or inconsistent coding. One main finding is that coding quality decreases with text length. Coding training must therefore be specifically targeted according to text lengths.

Using machine learning to classify open-ended answers to ‘the most important problem’ question in a longitudinal manner

Mr Jan Marquardt (GESIS Leibniz Institut für Sozialwissenschaften, Mannheim) - Presenting Author
Dr Julia Weiß (GESIS Leibniz Institut für Sozialwissenschaften, Mannheim)

Large survey programs often shy away from asking open-ended questions because when they do, they face the challenge of preparing a large amount of data with limited resources. At the same time, there is a great demand by researchers to receive responses to open-ended questions in a categorized form. A typical example of such a question is the “most important problem,” which is asked in many survey programs, especially in the context of election research. Responses are allowed to vary, ranging from more all-encompassing issues, such as climate change, to specific occasions, such as a political scandal.
Since hand-coding such responses is time-consuming and expensive for survey institutes, some are beginning to use machine learning approaches for coding. This study follows this approach and addresses two questions. First, which models are best suited to code the “most important problem” question? Second, what is a long-term solution that allows for the continuous coding of new surveys while allowing for longitudinal use of the resulting data by researchers?
With the goal of coding approximately 400000 mentions (collected in the context of the 2021 federal election in Germany) and based on a training dataset of approximately 300000 already coded mentions (collected for the 2017 federal election), the study compares the results of both classical machine learning models, e.g., a support vector machine, and models from the field of deep learning such as BERT. Initial results show that a transformer model like BERT significantly outperforms classical approaches while relying less on time-consuming decisions about data preparation. Furthermore, the usage of pre-trained information about word semantics allows for the continuous applicability of the model over time. However, successful continuous categorization requires a well-designed dynamic approach to both coding and the creation of an underlying codebook.

Automated classification of open-ended questions with BERT

Professor Hyukjun Gweon (University of Western Ontario)
Professor Matthias Schonlau (University of Waterloo) - Presenting Author

Answers to open-ended questions are often manually coded into different categories. This is time consuming. Automated coding uses statistical/machine learning to train on a small subset of manually coded text answers. The state of the art in NLP (natural language processing) has shifted: A general language model is first pre-trained on vast amounts of unrelated data, and then this model is adapted to a specific application data set. We empirically investigate whether BERT, the currently dominant pre-trained language model, is more effective at automated coding of answers to open-ended questions than non-pre-trained statistical learning approaches.

Automated coding of the open-ended questions in the Comparative Candidates’ Survey: Evidence from Greece and Switzerland

Dr Evangelia Kartsounidou (Aristotle University of Thessaloniki & CERTH) - Presenting Author
Professor Ioannis Andreadis (Aristotle University of Thessaloniki)
Mrs Nursel Alkoç (University of Lausanne)
Professor Anke Tresch (FORS & University of Lausanne)

This study focuses on the views of the candidate MPs on the most important problem of their country, through their answers to open-ended questions. Analyzing text data resulting from open-ended questions is time-consuming, and often limited by the language of study. This limitation often acts as a deterrent in terms of the use of data collected from open-ended questions and makes the comparative study of these answers almost impossible. Based on new technologies and automatic content analysis techniques, the main objective of this study is the development of a common methodological approach to code the answers to the open-ended questions in the candidate surveys. Taking as a starting point the Greek candidate surveys of the last decade (2009-2019) this paper tests a supervised machine learning approach (SML), using the issue categories from the Comparative Agendas Project, to code automatically the answers of the candidate MPs in the Greek and English languages. Manual and automated coding are compared to measure the classification accuracy. Then we apply the same approach to datasets from Swiss candidate survey and compare the results. The ultimate goal of this study is to contribute to the strengthening of the Comparative Candidates Survey (CCS), through the development of common core coding for the answers of the open-ended question about the most important problem of the country in all the participating countries and languages and the creation of a harmonized database.