ESRA 2019 Draft Programme at a Glance

Evaluating Survey Response Scales 4

Session Organisers Dr Morgan Earp (US Bureau of Labor Statistics)
Dr Robin Kaplan (US Bureau of Labor Statistics)
Dr Jean Fox (US Bureau of Labor Statistics)
TimeFriday 19th July, 09:00 - 10:30
Room D22

The accurate measurement of constructs in surveys depends on the use of valid and reliable item scales. Response scales often come in all shapes and sizes and can vary in their use of modifiers, such as “very” versus “extremely.” They can vary in features such as the number of response options, inclusion of numeric and/or semantic labels, scale direction, unipolar versus bipolar response options, and scale orientation. Item scales also can vary in their ability to distinguish between latent trait levels; some response options may provide more item characteristic information than others. Furthermore, with the variety of modes now available (such as web, mobile, and SMS text, as well as paper), there are additional considerations regarding how response scales can be presented (for example, single-item vs. matrix scales). With so many factors to consider, it can be difficult to know how to develop the optimal response scale for a particular construct or mode. This panel focuses on the investigation of item response scales and how they affect survey response and data quality using a variety of scale evaluation techniques including, but not limited to psychometric techniques. We invite submissions that explore all aspect of scale development and assessment, including:
(1) The impact of various question design features such as scale direction, scale length, horizontal vs. vertical scale orientation, use of modifiers, numeric labels, number of response options, etc. on survey response and data quality.
(2) The development and assessment of response scales across different data collection modes
(3) The use of psychometric and statistical measures for evaluating response scales, for example, item characteristics curves, differential item functioning, item invariance, different measures of reliability and validity, etc.
(4) Approaches for determining scale measurement invariance across different modes and devices (e.g., mobile).
(5) Comparisons of item-by-item versus matrix questions.
(6) Research showing the impact of different modifiers (for example, “a little” vs. “somewhat”).
(7) Exploration of differential item functioning and item invariance for varying item response scales
(8) Approaches for determining scale measurement invariance across different modes and devices (e.g., mobile).

Keywords: response scales, scale development, psychometrics, item response theory, confirmatory factor analysis

The effects of differences in response scale in cross-national surveys

Professor Noriko Iwai (JGSS Research Center, Osaka University of Commerce) - Presenting Author
Dr Satomi Yoshino (JGSS Research Center, Osaka University of Commerce)

Four of the East Asian Social Survey project teams have been making common modules (about 60 questions), incorporating them into social surveys regularly conducted in each country and region, collecting data and making the integrated data for cross-national/societal comparison since 2003. EASS 2006 Family module, EASS 2008 Culture and Globalization module, EASS 2010 Health module, EASS 2012 Social Network Capital module, EASS 2014/15 Work Life module, and EASS 2016 Family module. In making questions, four teams have been very careful in designing response scales, wording of the categories and their translations from the source language, English, to the target language. Based on preliminary surveys, EASS decided to use a balanced seven-point scale with strongly agree and strongly disagree in its ends for attitudinal questions. For a question on one’s subjective health condition, EASS decided to use a balanced five-point scale with very good and very bad in its ends. However, when making EASS 2010 Health module, four teams decided to incorporate SF-12 into the module, which resulted confusing response distributions regarding the subjective health condition. This confusion was caused due to the fact that the response categories of SF-12 had been translated slightly different between Japan, Korea and China by researchers who had introduced SF-12 in these countries. We will discuss this issue and also present the result of our split-ballot survey (JGSS-2017) on response scale of “Grit-S.”

"How healthy are you actually?" - The influence of response scales on the assessment of subjective health

Mrs Regina Jutz (GESIS Leibniz Institute for the Social Sciences) - Presenting Author
Professor Christof Wolf (GESIS Leibniz Institute for the Social Sciences)

Measurements of subjective health, which is an acknowledged indicator of general health status, are often used in social science research, and the questions seem to be similar in various comparative studies. Although a 5-point rating scale is usually used, there are differences in the categories that can influence the assessment of respondents’ health status. Basically, there are two different versions: The so-called European version uses a balanced 5-point scale with a middle category (1 very good, 2 good, 3 fair, 4 poor, 5 very poor), while the other American or international version uses a 5-point scale with predominantly positive categories (1 excellent, 2 very good, 3 good, 4 fair, 5 poor).
In Germany, through the joint implementation of the German General Social Survey ALLBUS and the International Social Survey Programme ISSP, we can test how the different response categories affect the assessment of subjective health. Since the two survey programs use different response categories, respondents are asked to answer both versions. With data from several years, the distribution of the two different versions and the relationships to socio-demographic variables are studied. Higher educational groups tend to rate their health better than lower educational groups, so their responses will be more differentiated when more categories on the positive spectrum are offered (excellent, very good, good). The opposite should apply to lower educational groups, which would benefit from having more nuances in less than good health (fair, poor, very poor) to choose from. There is no superior version per se. Rather, the advantages and drawbacks of versions will depend on the research question. For example, when it is about identifying groups with poor health, the European version may be better. We will also present recommendations for dichotomizing into good and poor health—a common practice in public health research.

Asking about Ideology: Experiments in Western Europe

Mr Jonathan Evans (Pew Research Center) - Presenting Author
Ms Martha McRoy (Pew Research Center)
Mr Scott Gardner (Pew Research Center)
Ms Stacy Pancratz (Pew Research Center)
Dr Neha Sahgal (Pew Research Center)
Ms Ariana Salazar (Pew Research Center)
Ms Kelsey Starr (Pew Research Center)
Dr Patrick Moynihan (Pew Research Center)

The increasingly polarized political landscape in Europe has renewed interest in the quality of survey measures of ideology. Of course, determining valid and reliable indicators of multidimensional concepts is always challenging—even more so within a multinational context given the cultural variety involved. Using nationally representative data from Pew Research Center telephone surveys conducted in 2017 and 2018 across France, Germany, Spain and the UK, we use two question-wording experiments to illustrate the sensitivity of a standard "left-right" political ideology measure to scale label modifications. The standard question for both experiments uses a seven-point scale with only endpoints labeled: "0 indicating extreme left" and "6 indicating extreme right."

In the first experiment, we use a split-ballot format and modify the standard question with different numeric endpoint labels—that is, moving from a 0-6 scale to 1-7 to avoid any negative association of "extreme left" with "0" (which could be interpreted as irrelevance).

The second experiment asks, in addition to the standard question, all respondents to describe themselves using a five-point, fully labeled scale—omitting numeric values altogether—as left, leaning left, center, leaning right or right. A fully labeled scale (without numeric values) is suggested by the literature to be less cognitively burdensome to respondents, yielding more valid and reliable data.

For each experiment, we compare the standard question to the alternatives in terms of the distribution of political ideology (including item nonresponse), demographic profiles and party affiliations of "left" versus "right," and the correlation between ideology and other attitudinal and behavioral variables. For the second experiment, as both ideology questions are posed to all respondents, we analyze those with incongruencies between ideology measures in terms of demographics and attitudes. We conclude with recommendations on the value of specific elements of political ideology questions for each country surveyed.

On the Utility of Genetically Modified Salmon to Differentiate between Attitudes and Opinions

Dr Ana Muñoz van den Eynde (Research Unit on Scientific Culture - CIEMAT) - Presenting Author

From the perspective of survey methodology, there is great overlap between attitudes and opinions, to the point that usually one is identified with the other. On the other hand, psychological models of attitudes tend to differentiate them from opinions. While the distinctive feature of attitudes is the evaluation of an entity in terms of approval or disapproval, opinions associate an object (in the broadest sense of the term) with an attribute. It has been repeatedly found in the literature the attitudinal inconsistency of the population, and a lack of correspondence between attitudes and behavior. We consider that this finding is related to two facts. First, attitude questions in surveys usually include an Agreement/Disagreement (AD) response format. And it has been found that this scale format generates significant cognitive burden, as it does not make clear to the respondents what the dimension of interest is for the researcher. The satisficing hypothesis is based on the presumption that answering survey questions usually entails hard cognitive work for respondents. These cognitive demands may exceed respondents’ motivation or ability. Therefore, they will look for cues in the question suggesting how to offer a seemingly defensible answer without devoting hard cognitive effort. In these circumstances, satisficing may lead respondents to employ a variety of response strategies instead of providing their true judgment. Second, AD questions actually measure opinions instead of attitudes. In this contribution we present evidence that in order to measure attitudes the questions must confront the respondents with a choice that implies a value judgment, as in the decision to buy a genetically modified salmon. Our results indicate that if we are really aimed at measuring attitudes, we have to design questions contributing to identify what respondents feel about the attitude object and not what they think about it.

Validating a Measure of Numeracy Skill Use in the Workplace for Incarcerated and Household Adults

Miss Emily Buehler (University of Manchester) - Presenting Author

The aim of this research is to construct a measure of numeracy skill use in the workplace that can be used to illuminate how jobs held by incarcerated adults compare to those in the general population with regard to how they offer opportunities for important cognitive development. The 2012/2014 Survey of Adult Skills conducted by the Programme for the International Assessment of Adult Competencies (PIAAC) asked about the type and frequency of numeracy tasks performed as part of one’s job to nationally-representative incarcerated and household adult samples. This research takes these items from the background questionnaire of this survey and creates a measure of numeracy skill use in the workplace using the principles of the Rasch rating scale model (RSM). In the interest of exploring options for strengthened validity, response categories were collapsed to produce an optimal categorization structure. Collapsing the central three response categories produced a measure with higher person separation and reliability coefficients. However this also resulted in findings of slight to moderate differential item functioning (DIF) for three items, indicating incarcerated and household adults with the same latent numeracy skill use ability had different probabilities of endorsing such items. A qualitative evaluation of the items displaying DIF and correlations of the person estimates produced by both the original and collapsed categories measures determined that the variation between groups is likely due to real differences in prison and free labor market work environments and does not substantially alter estimates of individual numeracy skill use. Findings from this research suggest a measure of numeracy skill use in the workplace for incarcerated and household adults could potentially be improved with fewer response categories and more items that ask about a broader range of numeracy tasks.