ESRA 2019 Draft Programme at a Glance


Evaluating Survey Response Scales 1

Session Organisers Dr Morgan Earp (US Bureau of Labor Statistics)
Dr Robin Kaplan (US Bureau of Labor Statistics)
Dr Jean Fox (US Bureau of Labor Statistics)
TimeThursday 18th July, 09:00 - 10:30
Room D22

The accurate measurement of constructs in surveys depends on the use of valid and reliable item scales. Response scales often come in all shapes and sizes and can vary in their use of modifiers, such as “very” versus “extremely.” They can vary in features such as the number of response options, inclusion of numeric and/or semantic labels, scale direction, unipolar versus bipolar response options, and scale orientation. Item scales also can vary in their ability to distinguish between latent trait levels; some response options may provide more item characteristic information than others. Furthermore, with the variety of modes now available (such as web, mobile, and SMS text, as well as paper), there are additional considerations regarding how response scales can be presented (for example, single-item vs. matrix scales). With so many factors to consider, it can be difficult to know how to develop the optimal response scale for a particular construct or mode. This panel focuses on the investigation of item response scales and how they affect survey response and data quality using a variety of scale evaluation techniques including, but not limited to psychometric techniques. We invite submissions that explore all aspect of scale development and assessment, including:
(1) The impact of various question design features such as scale direction, scale length, horizontal vs. vertical scale orientation, use of modifiers, numeric labels, number of response options, etc. on survey response and data quality.
(2) The development and assessment of response scales across different data collection modes
(3) The use of psychometric and statistical measures for evaluating response scales, for example, item characteristics curves, differential item functioning, item invariance, different measures of reliability and validity, etc.
(4) Approaches for determining scale measurement invariance across different modes and devices (e.g., mobile).
(5) Comparisons of item-by-item versus matrix questions.
(6) Research showing the impact of different modifiers (for example, “a little” vs. “somewhat”).
(7) Exploration of differential item functioning and item invariance for varying item response scales
(8) Approaches for determining scale measurement invariance across different modes and devices (e.g., mobile).

Keywords: response scales, scale development, psychometrics, item response theory, confirmatory factor analysis

Response Scales and the Measurement of Racial Attitudes: Agree-Disagree versus Construct Specific Formats

Professor David Wilson (University of Delaware) - Presenting Author
Professor Darren Davis (University of Notre Dame)
Dr Jennifer Dykema (University of Wisconsin)
Professor Nora Cate Schaeffer (University of Wisconsin)

Over the past four decades, the literature on racial attitudes suggests that negative sentiments are founded less on notions of biological inferiority and more on contemporary concerns related to values, also known as the “new racism.” The lot of measures supporting this argument, including the Modern Racism Scale, Symbolic Racism Scale, and Racial Resentment Scale, are based on response scales using a standard agree-disagree (AD) question format, which makes it easier for individuals to pay less attention to question content, respond to social desirability norms, and strive for cognitive consistency. While popular, AD scales have been shown to produce acquiescence bias and inferior quality data, leaving some to suggest that Construct Specific (CS) response options may be more effective. Yet, questions remain regarding the extent to which measures intended to capture the newer racism concepts contain systematic measurement error related to the AD format. Using a national random sample of recruited panel members through the Congressional Cooperative Election Study (CCES), we examine response patterns to two similar racial resentment scales, one using AD-formatted items another using CS-formatted items. We find slightly better scale measurement properties among the AD scale than the CS scale, but roughly equal concurrent validity. We also find a pervasive pattern of acquiescence bias among the AD items, and a few CS items, but only after controlling for education levels; perhaps because of education’s impact on racial attitudes rather than the AD format per se. The results suggest that AD items are at least on par with CS-worded items, and that racial attitudes may hold some protections against problems associated with other topics using AD formatting. Yet, we also conclude CS items might soon find their way into questionnaires measuring racial attitudes.


New Mobile-friendly Labels: Using Numbers as Labels

Professor Randall Thomas (Ipsos) - Presenting Author
Dr Frances Barlas (Ipsos)

Fully-anchored, semantically labeled scales have been found to have somewhat higher levels of validity than end-anchored scales (Krosnick, 1999). However, since most smartphone respondents take online surveys in the portrait orientation, even the horizontal presentation of five categories with full semantic labels can extend off the screen. We developed an alternative for smartphone screens - replacing all semantic labels with numeric labels as clickable buttons to anchor the responses, with plus or minus indicators (e.g. How much do you like doing X? -2 -1 0 +1 +2). While researchers often provide end labels for respondents, either in the response field or in a respondent instruction in the item stem (e.g., On a scale of 0 to 10, where ‘0’ means ‘Do not like’ and ’10’ means ‘Strongly like’…), we eliminated all semantic labels, allowing us to save screen space. Our interest was if respondents, when left to their own interpretations of the meaning and spacing of the responses, would use the numbers equivalently to semantically labeled responses. In this paper, we summarize 4 different studies where we randomly assigned respondents to receive either a semantic or a numeric response format and employed a wide variety of scale types, including affective, evaluative, intensity, and behavioral measures. We found that numerics were completed more quickly, and showed equivalent response distributions and validity as semantic formats. In one experiment, we compared the use of a respondent instruction that defined the end points (Use a scale of 1 to 5, where 1=Not at all important and 5=Very important.) versus no respondent instructions. When presented with the respondent instructions, respondents treated the scale as an end-anchored scale while without the instruction the response distribution was more similar to a semantic format. We discuss the future applications and some limitations we’ve encountered in our exploration of the numeric format.


How Good is “Good?” Experimental Studies of Individual Interpretations of Response Options in Likert-Type Scales using VAS as Evaluation Tool

Mr Elias Markstedt (University of Gothenburg) - Presenting Author
Dr Elina Lindgren (University of Gothenburg)
Dr Johan Martinsson (University of Gothenburg)

Likert type scales are widely used to measure attitudes in surveys. Inferences from Likert scales are based on two assumptions; 1) an equal distance between each consecutive scale-point, and 2) an equal interpretation of the response options across individuals and questions. This paper investigates the two assumptions through two web-based survey experiments with members of the Swedish online Citizen Panel, using visual analog scales (VAS). In experiment 1 (N=12,905) respondents were asked to position all response options of a 4-point and 5-point Likert item, on a 0–100 point VAS. This experiment aimed to test how respondents evaluate labels in a familiar context (they see all response options simultaneously and respond to a typical survey question). In experiment 2 (N=12,090), respondents were randomly assigned to evaluate the strength of a hypothetical other respondent’s answer on three attitudinal or behavioral questions. This experiment aimed to test how respondents evaluate response options when there is no scale context, and thereby showing which wording choice will get closest to equidistance. Experiment 1 yield support for assumption 1; the length between the response-options were perceived as equidistant. However, assumption 2 was violated in that interpretations of the total distance between the designated endpoints varied by educational level and age (the endpoint options were perceived as more extreme by some). Also experiment 2 demonstrate violations of the second assumption, in that interpretations varied by question subject. The findings add to contemporary research by showing that the assumptions for Likert scales may not always be fulfilled, and specifies factors that can condition the violations: respondents’ demographic background and question subject. These findings are important because assumption violations put the reliability of inferences based on Likert scales into question – assumptions which are often taken for granted without further testing.


Evaluating Qualifiers in Rating Scales

Dr Morgan Earp (US Bureau of Labor Statistics) - Presenting Author
Dr Jean Fox (US Bureau of Labor Statistics)
Dr Robin Kaplan (US Bureau of Labor Statistics)

Researchers in many domains use surveys to standardize data collection across respondents. These surveys often ask respondents to rate some construct, such as satisfaction or importance, typically using a 5-or 7-point scale. These scales use a variety of qualifiers, such as “Very,” “Quite a bit,” “Somewhat,” “A little,” or “Not at all.” However, these qualifiers are subjective, and respondents may interpret them differently. Additionally, it’s often unclear whether respondents are considering the full range of the responses, or if they can distinguish differences between adjacent response categories. To investigate these questions, we conducted two phases of research. First, we used Item Response Theory (IRT) to evaluate whether a variety of scales effectively captured the full range of the construct. The surveys we used were mostly customer satisfaction surveys, which included constructs such as satisfaction, effort, and ease of use. They spanned three categories of qualifiers: Strength/Intensity (e.g., Not at all, Somewhat, Very), Frequency (e.g., Never, Sometimes, Occasionally), and Evaluation (e.g., Good, Neutral, Bad). We present and compare IRT item step thresholds across latent constructs and qualifier categories. In the second phase of this research, we asked participants (N=600) to complete an online survey where they assigned values on a scale from 0 to 100 to different qualifiers. The values indicated “how much” the respondent felt each qualifier represents. For example, “Extremely” might get a high rating while “none at all” might get a low rating. Reviewing the values from the study should help researchers generate scales that better measure the full range of a construct, and determine when qualifiers are not noticeably different from one another. We discuss the implications of this research for the design of valid and reliable survey response scales.