ESRA 2017 Programme
|ESRA Conference App|
Thursday 20th July, 14:00 - 15:30 Room: Q4 ANF3
Occupation coding 2
|Chair||Professor Matthias Schonlau (University of Waterloo )|
|Coordinator 1||Mr Malte Schierholz (IAB )|
Session DetailsOccupation coding refers to coding a respondent’s text answer (or the interviewer’s transcription of the text answer) about the respondent’s job into one of many hundreds of occupation codes. We welcome any papers on this topic, including, but not limited to:
- measurement of occupations (e.g., mode, question design, …)
- handling of different occupational classifications (e.g., ISCO and national classifications)
- problems of coding (e.g., costs, data quality, …)
- techniques for coding (e.g., automatic coding, computer-assisted coding, manual coding, interview coding)
- computer algorithms for coding (e.g., machine learning, rule-based, …)
- cross-national and longitudinal issues
- Measurement of derived variables (e.g., ISEI, ESeC, SIOPS, job-exposure matrices, …)
- other methodological aspects related to occupation coding
Paper Details1. Changing from manual to automatic coding for Economic ativity and Ocupation using previous experience in manual coding
Mr Rui Alves (Statistics Portugal)
In Statistics Portugal, Economic Activity and Occupation are two of the most commonly collected variables when it comes to social surveys. Their classification is performed according Classificação Portuguesa das Atividades Económicas Rev.3 (CAE), and Classificação Portuguesa das Profissões 2010 (CPP).
Both classifications have a significant number of categories which itself poses some challenges to the coding process. Per year, approximately 100.000 responses needs coding for both these variables.
Nonetheless, the biggest challenge still remains the data to classify. In CAPI and CATI interviews, these questions are collected in as much detail as possible, based on descriptions of the occupation/main tasks and main activity/what is done in your work place. This means that data is collected with an open-ended question and without any pre-coded aid or any kind of input restriction. This results in a high diversity of textual descriptions. The same word can be written in a multitude of variations due to spelling errors, (mis)use of abbreviation, caps, accentuation or hyphenation, just to name a few. This is quite understandable since interviewers input this data “on-the-fly”.
Until now coding is exclusively done by a team of coding experts, some with more than 10 years experience. Knowing that manual coding is nowadays considered both time-consuming and error prone, Statistics Portugal started exploring automatic coding for the purpose of defining the best solution for the implementation in social surveys.
This paper / presentation will address three topics: (1) Create and expand existing dictionaries, (2) Make automatic coding accessible, and (3) Monitor performance and provide useful data to validate results and improve performance.
Taking advantage of having a database with more than 500.000 manually coded data, collected from a 5-year period, it was possible to compute distance metrics between strings from the dictionaries and strings written by interviewers to describe both Economic Activity and Occupation. For this purpose, it was used the stringdist R package by M.P.J. van der Loo (2014). This package provides, among others algorithms, the optimal string alignment distance (an extension of the Levenshtein distance that allows for transpositions of adjacent characters). The algorithm performed very well and was possible to expand 30% to 40% the original dictionaries with strings of data written by interviewers.
In a subsequent step, dictionaries were expanded with data from that previously coded answers.
In order to make automatic coding accessible, an R package - INEautoclass – was created to classify both Economic Activity and Occupation with a 2 or 3 digit level of detail. Dictionaries themselves are a part of the package as well useful documentation. This package is able to code 57% of all answers that come from the Labour Force Survey.
Before entering in production mode it’s vital to monitor performance and provide useful data to validate results and improve performance. Hence an RMarkdown report continuously provides information on coding percentage, error rates and detailed information when automatic coding does not match human coding.
2. Three Methods for Occupation Coding based on Statistical Learning
Mr Hyukjun Gweon (University of Waterloo)
Professor Matthias Schonlau (University of Waterloo)
Dr Lars Kaczmirek (GESIS)
Mr Michael Blohm (GESIS)
Professor Stefan Steiner (University of Waterloo)
Occupation coding refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually at great expense. We propose three methods for automatic coding: combining separate models for the detailed/aggregate occupation codes, a hybrid method combining a duplicate-based approach with statistical learning, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist.