ESRA logo
Tuesday 18th July      Wednesday 19th July      Thursday 20th July      Friday 21th July     




Wednesday 19th July, 16:00 - 17:30 Room: Q4 ANF3


Measuring and Coding Complex Items: (Semi-) Automated Solutions 2

Chair Dr Eric Harrison (City University London )

Session Details

All social surveys rely heavily on socio-demographic information about respondents and design their instruments with a view to collecting the fullest most accurate measurements of these that are possible. Data about education, ethnicity, occupation, labour market situation - of both respondents and their families - form the essential backdrop to attitudes and behaviours measured elsewhere.

But they are problematic. They are complex to code, either because there is a huge range of possible answers (occupation), or because the context varies enormously between countries (education), or because respondents don't routinely undertake the task of self-categorisation (ethnicity/ancestry). Interviewers have to be trained to explore and probe in order to retrieve the fullest information from the field. This makes socio-demographic time-consuming and expensive to collect. Often the initial 'first pass' field data has then to be recoded by experts into smaller, more sociologically informed schemas.

As interest grows in using self-administered web surveys, we invite papers that report on the development or use of technological solutions to address some of these problems. Contributions addressing any complex categorical variables are welcome. These might include, but not be restricted to, measures of education, occupation, ethnicity, labour market experience, social class, social status, social distance, social networks, or reporting of medical conditions.

While papers will need to define and introduce the problem, the emphasis of this session is on solutions, so submissions will be required to have a practical component and include, where relevant, some demonstration material.

Paper Details

1. The variety of requirements in job advertisements: From semi automatic to full automatic detection and classification (coding) into a hierarchical taxonomy.
Mr Manuel Schandock (research associate at Federal Institute for Vocational Education and Training)

In labour market research we face a serious lack of empirical information about ongoing trends for the employers demand for competences, skills and experience with technical equipment. In Germany we can use a wide range of labour market related surveys and process-induced data, collected and provided by federal institutions like the Federal Statistical Office (DESTATIS) or the Institute for Employment Research (IAB) for example. But these data suffer from a strong time lag from data collection until data access (surveys). And if they don‘t, deep information about job requirements is missed (process induced data). Job advertisements in contrast are a rich source of information. They are very up to date. And they are convenient and quite cheap to collect – in the case of using online sources. But there is also a challenge: The vast amount of information in job advertisements is completely unstructured. Which means we have to deal with natural language texts and we have to dig for the information before we analyse the data (data mining).
We will present a complex workflow for the data mining of the employees experience with technical equipment (tools) required by employers and stated in job ads (Information Extraction). We develop a hierarchical taxonomy with nearly 60 items and present a framework to classify the detected tools into this taxonomy. We did evaluate different machine learning algorithms and will introduce our results to the panel. The dataset we use is based on a large number of job ads hosted by the German Federal Employment Agency from 2011 up to 2016 with an overall amount of more than 2.000.000.
We aim to
1. develop an empirical based classification of tools,
2. enrich that classification with a large number of exactly named tools for
3. the productive usage in CATI or CAPI questionnaires (automated coding).
The presentation will cover
1. a short introduction of the data we use,
2. a look insight into the different phrases being used to state single employers needs,
3. the algorithm to detect all these different phrases in thousands of Job ads,
4. the algorithm to sum up different phrases into canonical phrases,
5. the machine learning approach to classify this canonical phrases into a taxonomy of tools and
6. the quantity structure in our data.


2. Occupation and industry coding in the Next Steps Age 25 Survey
Ms Darina Peycheva (Centre for Longitudinal Studies, UCL Institute of Education)
Mr Matthew Brown (Centre for Longitudinal Studies, UCL Institute of Education)
Ms Anni Oskala (NatCen Social Research)

Questions on occupation and industry are a core feature of many social surveys. The most common approach is to ask respondents a series of open-ended questions in which they describe their main activities in their job and the main activities of the organisation they work for. These responses are subsequently coded by specialist coders into occupational and industry classifications. In the UK the most commonly used coding schemes are the Standard Occupational Classification (SOC2010) and the Standard Industrial Classification (SIC2007).

Accurate coding, however, requires the information recorded to be as detailed as possible. In face-to-face and telephone surveys – interviewers are trained to probe respondents to ensure that sufficient detail is recorded to enable coding at the very low level of aggregation (i.e. into more specific group of occupations and industries), but in web surveys, where no interviewer is present, respondents may provide insufficient information to allow accurate classification.

This paper will describe a trial conducted in the pilot stage of the Next Steps Age 25 Survey. Next Steps (formerly known as LSYPE 1) is a longitudinal study of young people in England born in 1989/1990. The most recent sweep of data collection took place in 2015-2016, when respondents were aged 25/26. It used a sequential mixed-mode design (web, followed by telephone, followed by face-to-face interview). In the study pilot respondents searched a look-up database using keywords to identify their occupational and industrial code directly. Telephone and face-to-face interviewers also used the look-up, as firstly it was hoped that this would increase accuracy by allowing respondents to confirm that they agreed with the allocated code: and secondly, it would save costs as there would be no need for post interview coding.

The look-up approach was used alongside the ‘standard’ open text question approach so that the quality of data collected via the two methods could be compared. All pilot respondents were first asked to describe their job title and the type of work they do, and the type of organisation they work for. Web respondents then were asked to use the look-ups to select a code, whilst in telephone and face-to-face interviews this was done by the interviewer.

Separately from the pilot, the SIC/SOC look up questions underwent usability testing (including eye-tracking) for their overall design and functionality.

Evaluation of the pilot data and the cognitive interviews found that the SOC coding look-up option had potential to generate better quality data and was generally found to be user-friendly. The approach was therefore retained for the main stage of data collection, with respondents and interviewers being routed to the open question variant if they could not find a suitable code. The SIC coding look-up was found to be less successful and so was not retained for the main stage. The paper will also evaluate the quality of the occupational data collected in the main stage


3. Self-identification of occupation in web-surveys: respondents’ choice between autosuggest and search tree
Dr Kea Tijdens (University of Amsterdam / AIAS)

Most surveys use an open-ended question to measure occupation, followed by office coding. This is expensive and time-consuming, and some texts can be coded only highly aggregated or not at all. Alternatively, in web-surveys or during the interview respondents can self-identify their occupation from a large database of coded occupational titles. For the coding quality the size of the database is important, given that a national labour market easily has 10,000’s of job titles. The paper details the database.

For many years, the worldwide WageIndicator websites on work and wages apply this self-identification method. In its Salary Check web-visitors can identify their occupation and view the related salaries. In its web survey respondents are asked to self-identify their occupation. Both applications use the same multilingual database of approximately 1,600 occupational titles, all coded ISCO08 at five digits. Users can navigate the database by means of a 3-level search tree or by autosuggest (text string matching), using an API (Application Programming Interface).

To explore the use of the autosuggest versus the search tree for desk top and mobile use, we studied the meta data of the occupation API in the Dutch WageIndicator web survey from 26 June – 3 Nov 2016. In this datafile each click and each character in autosuggest is registered, resulting in 18 448 records from 2994 respondents. One in four uses a mobile device (25.6%), mobile users use less often search tree (Chisq 57.37, sign .000). Two in three start using the search tree (66.9%).

Do respondents go back and forth in the search tree? 2004 respondents started the search tree, 3% dropped out with one or two clicks, while 54% found their occupation in three clicks and 15% went back and forth one time and 28% went back and forth more than once.

Does response time differ across the groups? After controlling for drop outs and outliers (0.1 - 360 seconds), response times of 2817 respondents were analysed. Mean response time is equal for autosugggest and search tree (39 sec.), but median response time is larger for search tree than for autosugggest (24 versus 16 sec.).

Which respondents drop out? 5% of respondents drop out when they self-identify their occupation. No significance difference was found neither between mobile and non-mobile users nor between search tree and autosuggest users. Search tree users are more likely to drop out after five clicks. Autosuggest users are more likely to drop out after typing 20 characters.