ESRA logo

ESRA 2019 glance program


Evaluating Survey Response Scales 2

Session Organisers Dr Morgan Earp (US Bureau of Labor Statistics)
Dr Robin Kaplan (US Bureau of Labor Statistics)
Dr Jean Fox (US Bureau of Labor Statistics)
TimeThursday 18th July, 16:00 - 17:30
Room D22

The accurate measurement of constructs in surveys depends on the use of valid and reliable item scales. Response scales often come in all shapes and sizes and can vary in their use of modifiers, such as “very” versus “extremely.” They can vary in features such as the number of response options, inclusion of numeric and/or semantic labels, scale direction, unipolar versus bipolar response options, and scale orientation. Item scales also can vary in their ability to distinguish between latent trait levels; some response options may provide more item characteristic information than others. Furthermore, with the variety of modes now available (such as web, mobile, and SMS text, as well as paper), there are additional considerations regarding how response scales can be presented (for example, single-item vs. matrix scales). With so many factors to consider, it can be difficult to know how to develop the optimal response scale for a particular construct or mode. This panel focuses on the investigation of item response scales and how they affect survey response and data quality using a variety of scale evaluation techniques including, but not limited to psychometric techniques. We invite submissions that explore all aspect of scale development and assessment, including:
(1) The impact of various question design features such as scale direction, scale length, horizontal vs. vertical scale orientation, use of modifiers, numeric labels, number of response options, etc. on survey response and data quality.
(2) The development and assessment of response scales across different data collection modes
(3) The use of psychometric and statistical measures for evaluating response scales, for example, item characteristics curves, differential item functioning, item invariance, different measures of reliability and validity, etc.
(4) Approaches for determining scale measurement invariance across different modes and devices (e.g., mobile).
(5) Comparisons of item-by-item versus matrix questions.
(6) Research showing the impact of different modifiers (for example, “a little” vs. “somewhat”).
(7) Exploration of differential item functioning and item invariance for varying item response scales
(8) Approaches for determining scale measurement invariance across different modes and devices (e.g., mobile).

Keywords: response scales, scale development, psychometrics, item response theory, confirmatory factor analysis

Grids versus Item-by-Item Designs on Item Batteries for Self-Administered Mixed-Mode, Mixed-Device Surveys

Dr Kristen Olson (University of Nebraska-Lincoln) - Presenting Author
Dr Jolene Smyth (University of Nebraska-Lincoln)
Ms Angelica Phillips (University of Nebraska-Lincoln)

With surveys increasingly being completed on mobile devices, how to ask battery questions on mobile devices is important. One open question is whether battery questions, usually containing items that constitute a scale, should be displayed in a grid or each item displayed individually (item-by-item), and whether this display should differ by mode and device. Previous research focuses primarily on web panel members, ignoring those who answer by mail in mixed-mode studies. There is surprisingly little research comparing how respondents answer grid items in web versus mail modes (Kim, et al. 2018). Within the web mode, grid formats on a smartphone sometimes yield higher nondifferentiation rates than an item-by-item design or on personal computers (Stern, et al. 2016). In other studies, grids on computers (Lugtig and Toepol 2016) or item-by-item formats displayed on a computer (Keusch and Yan 2016) yield more nondifferentiated answers. In this paper, we compare data quality across four different batteries from a general population web-push mixed-mode survey (AAPOR RR2=28.1%, n=2705). Sample members were randomly assigned to receive batteries either in a grid or as individual items. Respondents could respond by mail or by web, using either a computer or a mobile device, allowing all formats to be measured in all modes/devices. In this paper, we examine data quality across formats and modes/devices on four outcomes: item nonresponse, nondifferentiation, inter-item correlations, and scale reliability. Preliminary analyses indicate that the grid format, compared to the item-by-item format, produces less nondifferentiation on mobile devices, more on computers, and no difference on mail. This holds accounting for respondent characteristics. We explore the sensitivity of our conclusions to different measures of nondifferentiation. We conclude with recommendations for practice and future research.


The Effects of Response Format on Data Quality in Personality Tests: Matrix vs. Fill-in

Mrs Ragnhildur Lilja Asgeirsdottir (Faculty of Psychology, University of Iceland; Methodological Research Center, University of Iceland) - Presenting Author
Dr Vaka Vésteinsdóttir (Methodological Research Center, University of Iceland; Research Methods, Assessment, & iScience, Department of Psychology, University of Konstanz)
Professor Ulf-Dietrich Reips (Research Methods, Assessment, & iScience, Department of Psychology, University of Konstanz)
Dr Fanney Thorsdottir (Faculty of Psychology, University of Iceland; Methodological Research Center, University of Iceland)

Surveys are frequently presented in different ways on paper and in web surveys. While the authors of questionnaires frequently present them with a fill-in response format on paper, where respondents fill in a number representing their responses, the format of the questionnaires is often changed to a matrix or grid format in web surveys. This applies to personality tests such as the Big Five Inventory (BFI). Previous studies have indicated that matrix formats, as opposed to item-by-item formats, have been associated with e.g., higher missing data rates, higher inter-item correlation and higher levels of straightlining. However, other studies have not found this effect. Furthermore, there is currently a lack of research comparing the matrix format to the fill-in response format. The purpose of this study was to examine the effects of the response format (matrix vs. fill-in) on data quality using the BFI in a probability based panel of the general population (N=272). Data quality was examined in terms of item nonresponse, acquiescence, and reliability. The results from this study indicate that although the response format does seem to have an effect on data quality, the effects seem to be weak. The implications of the results will be discussed further.


Agree or Disagree: What Came First?

Ms Carmen María León (University of Castilla-La Mancha) - Presenting Author
Dr Eva Aizpurua (Trinity College Dublin)
Ms Sophie van der Valk (Trinity College Dublin)

Response order effects refer to the impact on survey responses that arise by varying the order of the response options. Previous research has documented two type of effects, known as primacy and recency effects (Krosnick & Duane, 1987). Primacy effects occur when response options presented earlier are selected more often than those presented later. Recency effects, on the contrary, occur when response options presented later are more likely to be selected. These effects have been widely studied with unordered categorical response options. However, few studied have examined response order effects with ordinal scales, despite their extensive use in survey research. We contribute to filling this gap by analysing the effects of varying the direction of fully-labeled rating scales on survey responses. To do so, a split-ballot experiment was embedded in an online survey conducted in Spain (N = 1,000). Respondents were randomly assigned to one of two groups, which received the questions in the original order (from "strongly disagree" to "strongly agree") or in the reversed one (from “strongly agree" to "strongly disagree"). The results of the study are presented and the implications and recommendations for future research are discussed.


Evaluating Qualifiers in Rating Scales

Dr Morgan Earp (US Bureau of Labor Statistics) - Presenting Author
Dr Jean Fox (US Bureau of Labor Statistics)
Dr Robin Kaplan (US Bureau of Labor Statistics)

Researchers in many domains use surveys to standardize data collection across respondents. These surveys often ask respondents to rate some construct, such as satisfaction or importance, typically using a 5-or 7-point scale. These scales use a variety of qualifiers, such as “Very,” “Quite a bit,” “Somewhat,” “A little,” or “Not at all.” However, these qualifiers are subjective, and respondents may interpret them differently. Additionally, it’s often unclear whether respondents are considering the full range of the responses, or if they can distinguish differences between adjacent response categories. To investigate these questions, we conducted two phases of research. First, we used Item Response Theory (IRT) to evaluate whether a variety of scales effectively captured the full range of the construct. The surveys we used were mostly customer satisfaction surveys, which included constructs such as satisfaction, effort, and ease of use. They spanned three categories of qualifiers: Strength/Intensity (e.g., Not at all, Somewhat, Very), Frequency (e.g., Never, Sometimes, Occasionally), and Evaluation (e.g., Good, Neutral, Bad). We present and compare IRT item step thresholds across latent constructs and qualifier categories. In the second phase of this research, we asked participants (N=600) to complete an online survey where they assigned values on a scale from 0 to 100 to different qualifiers. The values indicated “how much” the respondent felt each qualifier represents. For example, “Extremely” might get a high rating while “none at all” might get a low rating. Reviewing the values from the study should help researchers generate scales that better measure the full range of a construct, and determine when qualifiers are not noticeably different from one another. We discuss the implications of this research for the design of valid and reliable survey response scales.


New Mobile-Friendly Labels: Using Numbers as Labels

Professor Randall Thomas (Ipsos) - Presenting Author
Dr Frances Barlas (Ipsos)

Fully-anchored, semantically labeled scales have been found to have somewhat higher levels of validity than end-anchored scales (Krosnick, 1999). However, since most smartphone respondents take online surveys in the portrait orientation, even the horizontal presentation of five categories with full semantic labels can extend off the screen. We developed an alternative for smartphone screens - replacing all semantic labels with numeric labels as clickable buttons to anchor the responses, with plus or minus indicators (e.g. How much do you like doing X? -2 -1 0 +1 +2). While researchers often provide end labels for respondents, either in the response field or in a respondent instruction in the item stem (e.g., On a scale of 0 to 10, where ‘0’ means ‘Do not like’ and ’10’ means ‘Strongly like’…), we eliminated all semantic labels, allowing us to save screen space. Our interest was if respondents, when left to their own interpretations of the meaning and spacing of the responses, would use the numbers equivalently to semantically labeled responses. In this paper, we summarize 4 different studies where we randomly assigned respondents to receive either a semantic or a numeric response format and employed a wide variety of scale types, including affective, evaluative, intensity, and behavioral measures. We found that numerics were completed more quickly, and showed equivalent response distributions and validity as semantic formats. In one experiment, we compared the use of a respondent instruction that defined the end points (Use a scale of 1 to 5, where 1=Not at all important and 5=Very important.) versus no respondent instructions. When presented with the respondent instructions, respondents treated the scale as an end-anchored scale while without the instruction the response distribution was more similar to a semantic format. We discuss the future applications and some limitations we’ve encountered in our exploration of the numeric format.