ESRA logo

ESRA 2023 Glance Program

All time references are in CEST

Occupation Coding

Session Organisers Dr Malte Schierholz (LMU Munich)
Mr Jan Simson (LMU Munich)
Ms Olga Kononykhina (LMU Munich)
TimeThursday 20 July, 16:00 - 17:30
Room U6-01a

Occupation coding refers to coding a respondent’s text answer (or the interviewer’s transcription of the text answer) about the respondent’s job into one of many hundreds of occupation codes. We welcome any papers on this topic, including, but not limited to:
- measurement of occupations (e.g., mode, question design, …)
- handling of different occupational classifications (e.g., ISCO, ESCO and national classifications)
- problems of coding (e.g., costs, data quality, …)
- techniques for coding (e.g., automatic coding, computer-assisted coding, manual coding, interview coding)
- computer algorithms for coding (e.g., machine learning, rule-based, …)
- cross-national and longitudinal issues
- Measurement of derived variables (e.g., ISEI, ESeC, SIOPS, job-exposure matrices, …)
- other methodological aspects related to occupation coding

Keywords: coding, measurement, occupation


Do national coding indexes code the same occupations similarly?

Dr Kea Tijdens (WageIndicator Foundation, University of Amsterdam) - Presenting Author

Current coding practices regarding the occupation question in surveys are typically performed by a national survey agency, that derives the codes from national coding indexes which typically classify detailed occupational titles into 4-digit ISCO-08. These indexes are typically prepared by national statistical offices (NSO). For a multi-country comparison of the coding activities, the question arises whether the same 5-digit occupational titles are coded into the same 4-digit ISCO-08 units. Note: ISCO-08 is available in English only. Occupational coding in multi-country surveys is basically a black box: are the same occupations coded similarly across countries? Our study aimed at a cross-country validation of occupational indexes. We build on a previous study for which we collected ISCO-08 coding indexes beyond 4-digit from 20 NSO’s (Tijdens and Kaandorp, DOI:10.13094/SMIF-2018-00007, SERISS-D8.4). Two research objectives were central. To what extent were titles in the indexes coded similar, when comparing their English translations? What percentage of similar occupational titles was coded similarly across countries? We merged 20 coding indexes (18 non-English), resulting in 70,489 records, of which 10.3% had non-existent codes. The remaining database of 60,559 records was translated in English, thereby choosing the best translation from DEEPL or GOOGLE TRANSLATE (4.2% could not be translated). 19,044 records (32%) had at least one duplicate title, which could be aggregated into 5,350 occupational titles. Only 64% of these titles had the same ISCO-08 4-digit code, 70% at 3-digit, 74% at 2 digit, and 80% at 1-digit. We added coding indexes from South Africa, Pakistan and India, whereby the unskilled/semi-skilled titles show a larger division of labour and the (highly)skilled are far less detailed.

Software-Assisted Occupational Coding - Evaluation of Coding and Validation Processes with CODI

Mr Malte Schwedes (Leibniz Institute for Educational Trajectories (LIfBi)) - Presenting Author
Mr Gregor Lampel (Leibniz Institute for Educational Trajectories (LIfBi))
Dr Markus Nester (Leibniz Institute for Educational Trajectories (LIfBi))

Coding of open information on occupations, industries or tasks is time-consuming and hence expensive when done manually - which is still common. To make coding more efficient, a tool ("CODI”) was developed, which utilizes and enriches the potential of the already established metadata base at the LIfBi. The tool provides easy but controlled access to the open material for the coders via a server-client application. Additional information (ex. occupational status) can be presented and processed to the coder and powerful suggestion have been implemented to facilitate the coding procedures. Furthermore the tool and the data-base allows the administration and supervision of the coding staff and coding jobs and provides gateways for reporting on progress and potential problems. The software is currently reworked and set to be available for public use for the scientific community by the end of 2023.
The aim is to analyze the benefits software assisted coding can offer but also show the potential problems in regard to data quality and how coding and validation processes can help with this issue.
We test the coding and validation processes the software offers with regard to coding efficiency and data quality. For that we use data coded for different projects at the LIfBi. Our evaluations contain metrics regarding the speed of the coding, the amount of automatic coded data in a typical dataset, and the quality of the software-assisted coding processes. For the latter, we evaluate the codes the software proposes in comparison to manually coded data. Manual coding was performed once with and without the suggestion algorithm. By coding the same batch of data with these three methods we can not only compare the raw datasets but also assess the impact the differences make in actual use of the data (ex. prestige scores and socioeconomic indexes).

A Holistic Toolbox for Interactive Occupation Coding in Surveys

Mr Jan Simson (LMU Munich) - Presenting Author
Mrs Olga Kononykhina (LMU Munich)
Dr Malte Schierholz (LMU Munich)

There are a multitude of different ways to earn a living which is why people’s occupations are almost as diverse as people themselves. Therefore, one of the classic issues one encounters when working with occupational data is the vast heterogeneity of occupations people have. To address this problem, a variety of different classifications have been developed, such as the International Standard Classification of Occupations (ISCO) or the German Klassifikation der Berufe (KldB), narrowing down the number of occupation categories into more manageable numbers.

This leads to a different problem: The coding of occupations into standardized categories is a time-intensive process plagued by reliability issues. To date the standard approach to code occupations is by collecting free text responses and having specifically trained personnel sit down with the classification manual, possibly assisted by computer software, to assign each category by hand post-hoc. Since coding typically occurs after data collection and with limited information, the assignment of categories is often ambiguous and unreliable.

Here we present a new instrument which implements a faster, more convenient and interactive occupation coding workflow where interviewees are included in the coding process. Using the best performing algorithm from a previous comparison of machine learning models, a list of suggested occupation categories from the Auxiliary Classification of Occupations (AuxCO) is generated, one of which is to be selected by the interviewee. Issues of ambiguity within occupational categories are addressed with clarifying follow-up questions. The instrument is implemented as part of a flexible toolbox, covering the whole process from data collection and questionnaire design to the final coding of occupations into KldB and ISCO. Anonymized training data and pre-trained machine-learning models for Germany are provided as part of the toolbox.

The open-source toolbox is available at

Evaluating Interviewer-Respondent Dynamics in Interactive Occupation Coding

Ms Olga Kononykhina (Ludwig-Maximilians-Universität München) - Presenting Author
Mr Jan Simson (Ludwig-Maximilians-Universität München)
Mr Malte Schierholz (Ludwig-Maximilians-Universität München)

Various approaches to automating the process of occupation coding have been developed in recent years, in response to the limitations posed by manual coding. Interactive coding methods are particularly attractive, and the advantages of using them include speed, flexibility of the input data, as well as reduction of costs for post-survey coding.

Results of such applications are promising. However, there are methodological challenges to address. Introducing a ML component into a survey means that interviewers will not know in advance what answer options the tool will offer, which might put an additional burden on them. For the respondent, ML generated answer options might mean selecting an answer that is only partially appropriate or when many answers are applicable. Also occupation is multidimensional and combines skills, training, duties, which converts into lengthy job descriptions respondents must choose from. From a comparison with manual coding, we have seen low levels of agreement with interactive coding, possibly due to such challenges. This requires further understanding of the interactive processes that happen during the interview.

As part of the testing of our new interactive occupational coding tool we conducted a CATI survey with 1455 responses. We have audio recordings available for 669 of them. Two coders used a behavioural coding technique to analyse primary and secondary interactions between interviewers and respondents.

The results show that user experience of the tool is rather positive: most interviewers followed the script, respondents understood answer options and usually found a suitable one. We still observed instances of miscommunication: ex. respondents answered before hearing all answer options. The effectiveness of the tool however should be further looked into as interviewers skipped some generated answer options, a sizable fraction of respondents indicated that more than one option suited their occupation, or that none of the options were suitable.

Analyze professional coders’ view regarding the Auxiliary Classification

Dr Malte Schierholz (LMU Munich) - Presenting Author
Mr Jan Simson (LMU Munich)
Ms Olga Kononykhina (LMU Munich)

There often is a middle step in occupation coding: Coders or respondents do not directly select categories from official classifications (KldB/ISCO), but instead select fitting job titles, which are then mapped into KldB/ISCO. To substitute job titles, which can be vague and ambiguous, the interactive occupation coding toolbox has implemented the Auxiliary Classification of Occupations (AuxCO), a list of 1226 brief and concise job descriptions, from which respondents can select the most appropriate one.

During development of AuxCO, the long definitions from the official classifications were essentially summarized. This should give a unique mapping between AuxCO and both official classifications, such that selecting the correct job description during the interview also implies the selection of the correct occupational category.

A striking result from our most recent data collections is that the disagreement between office coding and interview coding is high: More than 40% of the respondents selected an AuxCO job description whose associated KldB/ISCO category is not identical with the one selected by professional coders.

We evaluate the quality of AuxCO answer options . To do so, we asked two professional coders with several years of experience to code the job descriptions from AuxCO back into KldB 2010 and ISCO-08. We suspect that the frequent mismatch between office coding and interview coding is partly the result of mapping issues between AuxCO and the official classifications.

We illustrate the mechanisms that lead to problematic AuxCO job descriptions and discuss the implications more generally. While some of the problems are due to the construction of the official classifications, others are due to AuxCO principles being misaligned with professional coders. We will use these results for a more general, empirically founded discussion of AuxCO. In addition, the analysis provides new insights about long-standing issues that official occupational classifications have.