Thursday 20th July, 11:00 - 12:30 Room: Q4 ANF3

Occupation coding 1

Chair Professor Matthias Schonlau (University of Waterloo )
Coordinator 1Mr Malte Schierholz (IAB )

Occupation coding refers to coding a respondent’s text answer (or the interviewer’s transcription of the text answer) about the respondent’s job into one of many hundreds of occupation codes. We welcome any papers on this topic, including, but not limited to:
- measurement of occupations (e.g., mode, question design, …)
- handling of different occupational classifications (e.g., ISCO and national classifications)
- problems of coding (e.g., costs, data quality, …)
- techniques for coding (e.g., automatic coding, computer-assisted coding, manual coding, interview coding)
- computer algorithms for coding (e.g., machine learning, rule-based, …)
- cross-national and longitudinal issues
- Measurement of derived variables (e.g., ISEI, ESeC, SIOPS, job-exposure matrices, …)
- other methodological aspects related to occupation coding

1. Occupation Coding in The German Health Update (GEDA-Study 2014/15)
Mr Stefan Albrecht (Robert Koch Institute)
Mrs Marike Varga (Robert Koch Institute)
Mr Patrick Schmich (Robert Koch Institute)

The German Health Update (GEDA-Study) from 2014/15 is a representative study with randomly selected subjects of the German population (N > 20.000). One part of the written questionnaire gathered information concerning the occupation and work activities of the participants. The occupation and work activities were raised with open-end text answers. Overall the survey raised an amount of 14.826 open end data to code.
Those codes were supposed to be transformed into the German classification of Occupation from 2010 (KldB 2010). For this purpose, the Robert Koch-Institute created an in-house software solution for computer assisted coding. The software is based on a database of the German Federal Employment Agency. The coding process was oriented on a result report of the “Mikrozensus”. In addition to the open end occupation code are standardized questions containing additional information concerning the occupation, like education, sector or occupation status. The coding was done by two students. To secure the data quality, 20% of all cases (N=3.000) were encoded twice. 30% of the double encoded cases got a different result. Because of the missing selectivity of the KldB-Codes, a second phase of quality check was implemented, were the different encoded results were evaluated. 91.1% of the first encoded results were correct.
After the encoding of the KldB, the five-digit-code was the basis of a transformation into the International Standard Classification of Occupations 2008 (ISCO 08). With the RKI-internal software, an automated recoding of 80% (N ~ 11.900) of all cases was possible. Of the remaining 20% there was a transformation key so that for every KldB-Code there was a maximum of four possible ISCO analogies. After a total of 4 months, the transformation from open text to the KldB- and ISCO-Code was completed. The data is the basis for a social index of Germany (social economic status (SES) Lampert et al 2013). Additionally the ISCO-Code enables the Robert Koch-Institute to deliver data to the IMEI.
Due to the reproducibility of the whole process, the costs will minimize with additional Studies. The training of new encoders takes one month. The whole process was documented and captured into a SOP so that a standardization of the process is possible.

2. Computer Assisted Manual Coding of Occupations - Best Practice from the German National Educational Panel Study (NEPS) and First Results on Differences in Reliability, Productivity and Derived Scales using Alternative Approaches.
Mr Markus Zielonka (Leibniz Institute for Educational Trajectories)
Mr Gregor Czerner (Leibniz Institute for Educational Trajectories)

Coding of open information on occupations, industries or tasks is a time-consuming and hence expensive enterprise when done manually - which is still common. Therefore we will present the actual process developments for coding open survey answers at the research data center of the Leibniz Institute for Educational Trajectories (LIfBi). We will give insights in the daily routine of coding for the diverse scenarios on occupations and other open answer formats especially for the six starting cohorts of the National Educational Panel Study. To handle the huge workload most effectively, a coding tool (“CODI”) was developed, which utilizes and enriches the potential of the already established meta-data base (SQL based) at the LIfBi. The tool provides easy but controlled access to the open material for the coders via a server-client application. Additional information (e.g. on relevant covariates like occupational status) can be presented and processed to the coder and powerful suggestion and search mechanisms have been implemented to facilitate the manual coding procedures. Furthermore the tool and the data-base allows the administration and supervision of the coding staff and coding jobs and provides gateways for reporting on progress and potential problems.
As we implemented also an option for parallel coding or multi-coder settings on the same material, easy reliability and productivity tests can be made.
This option at hand, we test the standard coding procedure for open occupational information at LIfBi, which relies on the so called “Dokumentationskennziffer” (DKZ) of the Federal Employment Agency (BA), against an alternative approach using the national classification of occupations (KLdB 2010). Since the DKZ is an enriched and permanently updated version of the KLdB 2010 (1286 categories) and contains all official occupation and vocational training names processed for job placement activities at the BA (~ 27000), the open questions to be answered revolve around the differences in quality and speed for both procedures given the established technical setting. Furthermore the differences in validity of derived standard scales on socio-economic status and prestige are to be tested.
To gain reliable results, 1000 randomly drawn open answers on occupations were coded by four trained and experienced coders (500 into DKZ and 500 into KldB 2010 by each coder).
The final 2000 DKZ and 2000 KldB codes were used to derive and compare standard scales on socio-economic status (ISEI08) and prestige (SIOPS 08).

3. Coding and scaling of parental occupations in the European Social Survey R1-R7
Professor Harry B.G. Ganzeboom (Department of Sociology, VU University Amsterdam)

Parental occupations in ESS have been collected in all seven rounds with an open-ended question. Oddly enough, (most of) these occupations have not been coded by the original data national collectors, but are available verbatim as part of the ESS data. Some 450.000 occupational titles of fathers and mothers are available, in around 25 languages. The coding of this information into the International Standard Classification of Occupations 1988 (ISCO-88) of rounds R1- R5 has been finished, and the coding of the next two rounds R6-R7 is currently is process, as well as the upgrade of this information into the new International Standard Classification of Occupations 2008 (ISCO-08).

As the ESS also collects closed question information on parental occupations, using a showcard, the available data make for a unique and desirable case in which verbatim information on multiple occupations is available together with a strong validation criterion. In this paper I report about how the coding of the parental occupations was accomplished and how to test the quality of the coded information using an MTMM (multiple trait, multiple indicator) model, which allows to diagnose and correct for random and systematic measurement error.

Preliminary results indicate:
• Parental occupational status in ESS is measured by the open question with a reliability coefficient of around 0.85, which does not vary much between countries, ESS-rounds or classifications (ISCO88 / ISCO08).
• The initial showcard used in ESS R1-R3 as considerable lower reliability and validity than the showcard that was introduced in R4 and adapted from the ISSP 1987. These results suggests that many respondents initially did not know how to answer, and often resolved their hesitation by choosing the same category for father and mother.
• The showcard used from R4 onwards has better reliability and only slight lower validity than the open question. However, a major improvement in measurement quality is achieved by using them both open and closed quations as multiple indicators in a latent variables model.