All time references are in CEST
Analyzing Open-Ended Questions in Survey Research 2
|Session Organiser|| Dr Alice Barth (Department of Sociology, University of Bonn)
|Time||Wednesday 19 July, 11:00 - 12:30|
Open-ended questions in surveys provide information on respondents’ personal perspectives, their interpretation and understanding of concepts. In addition, respondents get an opportunity for individualized feedback. Open-ended questions are almost indispensable in collecting data on issues that are too diverse for standardized questions, e.g. job designations. While many researchers used to refrain from open-ended questions due to the arduous process of transcribing and coding responses, the rise of computer-assisted data collection and software solutions for analyzing large amounts of text data can now accelerate the work process. Nevertheless, in dealing with open-ended questions, researchers need a specific methodological toolbox that is adapted to analyzing unstructured text data.
This session aims at discussing methods for processing and analyzing responses to open-ended questions. Topics of particular interest are, for example,
- coding responses to open-ended questions (manual, semi-automated or automated coding)
- text mining / natural language processing approaches
- qualitative content analysis of responses
- data quality and mode effects in open-ended questions
- open-ended probes as a means of evaluating comparability, validity and comprehension of questions
- analyzing respondent feedback on the survey process
- using information from open-ended questions to complement or contradict results from standardized questions
We are looking forward to contributions that highlight the methodological and/or substantial potential of open-ended questions in survey research.
Keywords: open-ended questions; data quality; text mining; text data; content analysis
Dr Kristen Hibben (US National Center for Health Statistics) - Presenting Author
Open-ended survey questions or web probes can be valuable because they allow respondents to provide additional information without the constraints of predetermined closed-ended options. However, open-text responses are more prone to item nonresponse, as well as inadequate and irrelevant responses. Further, time and cost factors associated with processing large sets of qualitative data often hinder the use of open-text data.
To address these challenges, we developed the Semi-Automated Nonresponse Detector (SANDS) model that draws on recent technological advancements in combination with targeted human-coding. The model is based on a Bidirectional Encoder Representation from Transformers model, fine-tuned using Simple Contrastive Sentence Embedding. This powerful approach uses state-of the-art natural language processing, as opposed to previous nonresponse detection approaches that have relied exclusively on rules or regular expressions, or bag-of-words, which tend to perform less well on short pieces of text, typos, or uncommon words.
In this presentation, we present our process of training and refining the model. We summarize the results of an extensive evaluation process including the use open-text responses from a series of web probes as case studies, comparing model results against human-coded source of truth or hand-reviewed random samples, and sensitivity and specificity calculations to quantify model performance. Open-text web probe data are from the Research and Development Survey During COVID-19 surveys created by the National Center for Health Statistics. NORC collected the data in multiple rounds of the survey in 2020 and 2021 using a probability-based panel representative of the US adult English-speaking non-institutionalized population. We also present results of analyses to detect potential bias in the model and possible implications, information about how the model can be accessed by others, and best practice tips and guidelines.
Dr Daniela Wetzelhütter (University of Applied Sciences ) - Presenting Author
The possibility of generating a large number of data sets quickly and inexpensively by means of online surveys is increasingly being used to also inquire about - supposedly "simple" - qualitative aspects. Thus, respondents are increasingly confronted with the task of formulating and writing down answers themselves. Experience shows that this "task" often leads to a large amount of paraphrased statements - i.e. partly short, abbreviated answers ("SMS style") - while formulated sentences or paragraphs occur comparatively less frequently.
The characteristics of these answers are - following the logic of the survey process (previous steps influence subsequent ones) - decisive for the quality of the coding. Attributing a more consistent manageability to the "more detailed" answers in the course of coding and classifying shorter answers as rather difficult can lead to problems in the training of coders. This is because: short statements can be concise and thus unambiguous and, conversely, wordy statements can be imprecise and thus difficult to code. This means that the number of letters or words does not automatically indicate a high or low quality of answers and thus of coding. Based on 667 open-ended responses to an open-ended question (resulting in 14344 codings by 13 coders), the paper addresses the challenge of identifying responses that (may) lead to consistent or inconsistent coding. One main finding is that coding quality decreases with text length. Coding training must therefore be specifically targeted according to text lengths.
Mr Jan Marquardt (GESIS Leibniz Institut für Sozialwissenschaften, Mannheim) - Presenting Author
Dr Julia Weiß (GESIS Leibniz Institut für Sozialwissenschaften, Mannheim)
Large survey programs often shy away from asking open-ended questions because when they do, they face the challenge of preparing a large amount of data with limited resources. At the same time, there is a great demand by researchers to receive responses to open-ended questions in a categorized form. A typical example of such a question is the “most important problem,” which is asked in many survey programs, especially in the context of election research. Responses are allowed to vary, ranging from more all-encompassing issues, such as climate change, to specific occasions, such as a political scandal.
Since hand-coding such responses is time-consuming and expensive for survey institutes, some are beginning to use machine learning approaches for coding. This study follows this approach and addresses two questions. First, which models are best suited to code the “most important problem” question? Second, what is a long-term solution that allows for the continuous coding of new surveys while allowing for longitudinal use of the resulting data by researchers?
With the goal of coding approximately 400000 mentions (collected in the context of the 2021 federal election in Germany) and based on a training dataset of approximately 300000 already coded mentions (collected for the 2017 federal election), the study compares the results of both classical machine learning models, e.g., a support vector machine, and models from the field of deep learning such as BERT. Initial results show that a transformer model like BERT significantly outperforms classical approaches while relying less on time-consuming decisions about data preparation. Furthermore, the usage of pre-trained information about word semantics allows for the continuous applicability of the model over time. However, successful continuous categorization requires a well-designed dynamic approach to both coding and the creation of an underlying codebook.
Professor Hyukjun Gweon (University of Western Ontario)
Professor Matthias Schonlau (University of Waterloo) - Presenting Author
Answers to open-ended questions are often manually coded into different categories. This is time consuming. Automated coding uses statistical/machine learning to train on a small subset of manually coded text answers. The state of the art in NLP (natural language processing) has shifted: A general language model is first pre-trained on vast amounts of unrelated data, and then this model is adapted to a specific application data set. We empirically investigate whether BERT, the currently dominant pre-trained language model, is more effective at automated coding of answers to open-ended questions than non-pre-trained statistical learning approaches.
Ms Kübra Annac (Witten/Herdecke University) - Presenting Author
Dr Yüce Yilmaz-Aslan (Witten/Herdecke University)
Professor Patrick Brzoska (Witten/Herdecke University)
During the COVID-19 pandemic, ensuring patient-centered rehabilitation while minimizing the risk of infection has been challenging for rehabilitation facilities. The aim of this cross-sectional study was to investigate which problems directors of rehabilitation facilities have encountered in addressing the pandemic and where there is still a need for support. Methodically, different types of survey questions and modes have been explored for this purpose.
By means of an online and an additional postal survey, all medical rehabilitation facilities in Germany (n=1,629) were surveyed between June and September 2021. The questionnaire was developed in collaboration with experts based on previous research results as well as on literature. Closed-ended questions with predefined choices and in-depth open-ended questions were used. Responses were available from 535 facilities (response rate=32.8%) and were analyzed descriptively. Responses from open-ended questions were analyzed using qualitative content analysis.
50.8% of the facilities had problems implementing measures on processes and structures. 36.7% of respondents considered the implementation of some types of therapies to be problematic. The open-ended questions about the reasons for encountered challenges generated up to 298 extensive responses each. The responses revealed that primarily financial factors, a deficient infrastructure, and a lack of human resources were obstacles to implementation. Factors promoting implementation are financial support from health care organizations, increased digitalization of services, and uniform regulations.
The study illustrates the methodological and substantive potential of open-ended questions in quantitative surveys – in the present case identifying deficits and informing about how systematic change in the health care system can be initiated through adequate multi-stakeholder communication. In order to exploit the full potential of open-ended questions in quantitative surveys, strategies need to be explored to effectively process and analyze large amounts of qualitative data.