ESRA logo
Tuesday 18th July      Wednesday 19th July      Thursday 20th July      Friday 21th July     




Thursday 20th July, 16:00 - 17:30 Room: F2 103


Proper and robust multiple imputation of complex data

Chair Dr Kristian Kleinke (University of Hagen )
Coordinator 1Professor Martin Spiess (University of Hamburg)
Coordinator 2Professor Jost Reinecke (University of Bielefeld)

Session Details

Allison (2001) states that the best solution to the missing data problem is prevention. This is especially true for complex data sets like multilevel data. Here, missingness may occur at various levels: in the outcome variable(s), in level-1 predictors, level-2 predictors, or even higher levels, and finally even in the group identifier(s). Many researchers still handle missingness (e.g. in multilevel data in level-1 and level-2 predictors) by excluding the incomplete cases from the analysis – a wasteful practice, which may lead to biased inferences. On the other hand, also none of the currently existing multiple imputation solutions for complex data can be described as optimal, as they either rely rather heavily upon strong distributional assumptions, often including homoscedasticity, which are frequently violated in “real life” situations. On the other hand, non- or semiparametric imputations methods often lack justification. Recent papers that contrast and review various strategies to impute complex or multilevel data are Drechsler (2015) and Enders, Mistler and Keller (2016). Shortcomings of some imputation techniques or consequences of misspecifications even in simple data sets are considered, e.g. in de Jong, van Buuren and Spiess (2016) or He and Raghunathan (2009).
All in all, missing data in complex data structures is a field where a lot of research still has to be done. Feasible and robust software solutions need to be developed that work, even when empirical data do not exactly follow the convenient statistical distributions assumed by the respective procedures.
We invite colleagues to present their research on multiple imputation solutions for complex data structures (e.g. clustered data, longitudinal data, panel data, cohort-sequential designs, etc.). We especially encourage proposals for robust procedures for “non-normal” missing data problems, i.e. when convenient distributional assumptions of standard MI procedures (normality, homoscedasticity) are violated. Also simulations that evaluate/compare different MI procedures regarding their robustness against violated assumptions are highly welcome.

(For a list of references, see: http://e.feu.de/esra2017)

Paper Details

1. Imputation of missing data by design using neural networks – How to shorten questionnaire length without sacrificing the amount of information collected
Ms Sarah Jensen (University of Wuppertal - Schumpeter School of Business and Economics - Chair of Methods of Economic and Social Research)
Professor Dirk Temme (University of Wuppertal - Schumpeter School of Business and Economics - Chair of Methods of Economic and Social Research)

Research problem
The length of survey questionnaires can negatively impact data quality e.g. an increasing amount of missing values due to item- or unit-nonresponse. Furthermore, long questionnaires can easily make respondents feel bored which eventually might compromise the response quality (Herzog, Bachmann 1981). This is a reason to keep questionnaires at a reasonable length to enhance respondents' motivation and decrease the amount of missing values and also increase the response quality. Because there is a trade-off between survey length and the number of variables in a questionnaire, the challenging research problem thus is to keep the questionnaire rather short without having to sacrifice the amount of information being collected.

Research approach
In order to ensure a reasonable questionnaire length without having to forgo relevant information, we propose a multistep approach: First, a full questionnaire including all variables of interest is applied on a training sample drawn from the population. Since the training sample is supposed to be considerably smaller compared to the final sample for which a shortened questionnaire will be used, researchers can spend their limited resources improving the response quality (e.g., incentives). Second, a multilayer feed-forward neural network is applied to the training data to determine the relationships between all variables included in the full questionnaire (computed as weights). In this step, any missing data has to be imputed by multiple imputation algorithms. Third, for the final sample's survey a shortened questionnaire is designed by discarding a certain amount of variables from the original one. Thus, this procedure leads to a specific pattern of missing data by design (missing data on all cases for the discarded variables). Finally, after collecting the data on the final sample the learned neural network is fed with data on the observed variables in order to predict values for the missing data by design and for missing values on the remaining variables. Neural networks are advantageous when dealing with various missing data patterns (MCAR, MAR and MNAR; Abdella, Marwala 2005) and linear as well as non-linear relationships between the variables.

Findings
The validity of the suggested approach will be assessed based on German General Social Survey (ALLBUS) from 2006. First, a relatively small random sample of the original data set will be drawn and used to learn the neural network. Second, several variables will be deleted in the data set thus leading to missing values by design. Third, the omitted variables by design will be imputed by applying the neural network to the reduced data set. Finally, we will compare the imputed values with the original data using appropriate measures like root mean-square deviation. We assume that neural networks produce robust estimators for the omitted variables by design.

References
Herzog, A.R., J.G. Bachmann 1981: Effects of questionnaire length on response quality. The Public Opinion Quarterly 45: 549 – 559.
Abdella, M., T. Marwala 2005: The use of genetic algorithms and neural networks to approximate missing data in database. Computing and Informatics 24: 577 - 589.


2. Quantile regression based multiple imputation
Dr Kristian Kleinke (University of Hagen)
Professor Mark Stemmler (University of Erlangen-Nuremberg)
Professor Friedrich Lösel (University of Cambridge and University of Erlangen-Nuremberg)

Research by Yu, Burton, and Rivero-Arias (2007) suggests, that use of multiple imputation procedures should be avoided, when their (distributional) assumptions do not fit the empirical data at hand.

Unfortunately, standard multiple imputation (MI) solutions that are implemented in most statistical packages, like for example fully parametric MI under the multivariate normal model (norm, Schafer, 1997) or semi-parametric predictive mean matching (pmm) make assumptions (normality, homoscedasticity) that are often unrealistic in real-life-situations. Using inappropriate missing data techniques could lead to biased inferences. Kleinke (in press) has for example demonstrated that norm and pmm fail, when data are too heavily skewed.

Unfortunately, robust procedures that can cope with various aspects of "non-normality" and heterogeneity are typically not yet implemented in major statistical packages.

Therefore, one purpose of this paper is to familiarize practitioners with available robust MI options.

Secondly, evaluation studies that systematically tested the assumed robustness of these procedures are extremely scarce.

The present study contributes to this end. We compared the performance of various standard and robust MI procodedures regarding their ability to yield unbiased parameter estimates and standard errors, when the empirical data are skewed and heteroscedastic.

Performance was evaluated in a Monte Carlo Simulation based on empirical data from the Erlangen-Nuremberg development and prevention project: we predicted physical punishment by fathers of elementary school children by socio-demographic variables, fathers’ aggressiveness, dysfunctional parent-child relations and various other parenting characteristics (cf. Haupt, Lösel, & Stemmler, 2014). Haupt et al. (2014) compared results of standard OLS-regression and more robust regression procedures and deemed quantile regression an appropriate method to analyse these data. Analogously, we assumed that creating multiple imputations under a quantile regression based imputation model would be more appropriate than using regression based MI solutions that rely on a normal homoscedastic model.

We simulated 1000 complete data sets (by sampling with replacement) from the original data, introduced MAR missingness in the dependent variable and imputed the data using standard and robust MI procedures. Like Haupt et al. (2014) we analysed the complete data using quantile regression. Multiple imputation results were finally compared against the complete data results.

Based on our findings, we would like to discourage the use of standard parametric MI procedures, when empirical data are skewed and heteroscedastic, as imputations based on a normal homoscedastic model distort the original distribution of the variable quite noticeably, and lead to biased statistical inferences. Unless the missing data percentage is negligibly small, we argue in favour of using MI procedures that adequately address the relevant aspects of non-normality and heterogeneity of the empirical data at hand.


3. How robust are multiple imputation based inferences in multilevel models?
Professor Martin Spiess (University of Hamburg)
Dr Kristian Kleinke (FernUniversität Hagen)

Multilevel models are usually based on strong distributional assumptions, like the multivariate normal, for all the random components. On the other hand, distributional assumptions about the covariates are usually not made. If, however, the covariates are subject to missing values and multiple imputation is chosen as a method to compensate for missing values, then one has to formulate a model for the predictive distribution of these variables, given all other variables and parameters.

These models are prone to erroneous assumptions because the corresponding conditional distributions are usually not subject of scientific interest. For example, one would probably have to model the conditional distribution of age given gender, scores from tests and questionnaires and socio-demographic variables. And to make things worse, even if the dependent variable in a regression model is conditionally normally distributed and a covariate is continuous, the reverse regression of the covariate given the dependent variable can only be a linear regression if both are jointly normally distributed. This implies that if a covariate in a multilevel model is to be imputed, then adopting an imputation model that assumes that the variable to be imputed is normally distributed given all observed variable values and parameters, is a misspecified imputation model. Since it may rarely be the case that all variables, parameters and errors are in fact normally distributed, in applications imputations are presumably often generated under misspecified models.

Thus, in this paper we present the results of a simulation study in which we investigate the consequences of misspecified imputation models on inferences in multilevel models. In particular, we consider distributions of the covariates that differ in skewness and curtosis, ignorable missing mechanisms that differ in their selectivity and different sample sizes. Finally, the results are discussed in the light of imputation methods available for multilevel models and further requirements of more flexible imputation techniques are highlighted.