ESRA 2017 Programme
|ESRA Conference App|
Wednesday 19th July, 11:00 - 12:30 Room: F2 103
Handling missing data 2
|Chair||Professor George Ploubidis (University College London )|
|Coordinator 1||Mr Brian Dodgeon (University College London)|
Session DetailsSelection bias, in the form of incomplete or missing data is unavoidable in surveys. It results in smaller samples, incomplete histories, lower statistical power and bias in sample composition if missingness is related to the observed and unobserved characteristics of respondents. It is well known that unbiased estimates cannot be obtained without properly addressing the implications of incompleteness. In this session we focus on item missingness, survey non-response, and attrition over time in longitudinal surveys. We aim to identify best practices when dealing with missing data.
Under Rubin’s framework, three types of missingness exist: Missing completely at ransom (MCAR) where the likelihood of response is unrelated to the respondents’ characteristics. Missing at random (MAR) where the likelihood of response is explained by the observed characteristics of respondents, and missing not at random (MNAR) where the likelihood of response is related to both observed and unobserved characteristics of respondents.
The objective of our session is to examine the principled techniques commonly used to deal with missing data. These include, inverse probability weights, multiple imputation, and full information maximum likelihood (FIML). All techniques rely on the MAR assumption, and therefore, their plausibility depends on the ability of the researcher to identify the predictors of response.
Contributors are welcomed to contrast these techniques with other procedures such as case-wise deletion, mean replacement, regression imputations, selection models (e.g. Heckman selection models), and others. Moreover, theoretical, empirical, and substantive applications of these techniques will be considered for presentation.
Paper Details1. Multiple Imputation of Missing Values: Opportunities and limitations in “real-life” applications
Ms Laura Ravazzini (FORS)
Dr Michael Ochsner (FORS)
Missing data occur in almost all survey research. On the one hand, methods to handle missing data have been improved and implemented in many statistical packages in the last decades. On the other hand, most research on missing data is based on simulated data. When applied to “real-life” data, and especially to attitudinal survey data, there are often technical difficulties like non-convergence or complicated estimation models that refrain researchers from applying adequate missing data handling techniques. Therefore, many researchers are often uncomfortable with the treatment of these missing data or ignore the missingness right away. Inadequate handling of missing data can lead to biased estimates of parameters such as means or regression coefficients, as well as their standard errors. All these issues can of course distort inference.
In this presentation, we will problematize practical issues that arise in the application of missing data handling techniques in research situations with “real-life” data. We will show that, in fact, these problems often turn out to be very useful hints at problems of data quality or at small problems in the models of interest. We will take the data from the European Social Survey (ESS) 2008 (Round 4) as an example for “real-life” survey data. The ESS is a survey of high quality, being characterized by low item-nonresponse for almost all variables and complete documentation. We will show the issues that we were confronted with in an applied research question on attitudes towards redistribution. In this research question, the variable on the household income contains key information. Despite the high quality of the ESS data, the variable is affected by a high non-response rate (on average 20% of answers are missing or incomplete). Furthermore, the handling of the missing data is complicated by the categorical nature of many variables in the ESS. First, we will demonstrate how the process of applying multiple imputation to the ESS data helped to detect data issues, e.g. comparability issues between countries. Then, we will apply different commonly used approaches to missing data that are not (necessarily) involving any reflection of the missingness (i.e. mean-substitution, 0-substitution and complete case analysis) and compare it to the results obtained with multiple imputation. Finally, we will reflect on the needs of applied survey researchers regarding the application of multiple imputation and other sophisticated methods to handle missing data.
2. Using missing data to impute missing data
Mr Micha Fischer (University of Michigan)
Mrs Felicitas Mittereder (University of Michigan)
In survey data, to impute missing values in a variable of interest we usually assume complete covariates. Often this is not the case and we first impute the missing values within the covariates (i.e. pre-imputation step). This method can be problematic in three ways: first, the pre-imputation step can result in a less efficient imputation process. Second, if the imputation models in the pre-imputation step are not correctly specified, we might introduce imputation bias to survey estimates. Third, we lose the information that the respondent chose not to answer the question. Assuming MAR and NMAR, this information by itself might be important for the imputation model. Thus, the item missing pattern of respondents can be informative for the outcome variables. By including missingness as its own category we can improve imputation accuracy and therefore the estimators for survey data. Tree-based methods (e.g. random forest) can incorporate this addional information and account for complex interactions in the covariates at the same time. Using random forests in a simulation study, we show that our approach is more precise and accurate compared to the usual approaches (e.g. MICE, hotdeck). Further, we apply this method to survey data with validation data from administrative records and show that our results from the simulation study hold true.
3. Handling non-response bias in polls: the Spanish general elections (1979-2015)
Mr Pablo Cabrera Alvarez (Universidad de Salamanca)
Mr Modesto Escobar Mercado (Universidad de Salamanca)
Unit and item non-response have been extensively studied in the field of survey methodology (Bethlehem, Cobben, & Schouten, 2011; Dillman, Eltinge, Groves, & Little, 2002; Groves & Couper, 1998). If respondents differ from non-respondents in the studied characteristic, opinion or behavior, the estimate will be biased. This is also a problem in election polls where one of the main objectives is to estimate the vote share of the parties in the electoral race. However, in the case of election polls, the election day provides a unique opportunity to assess the magnitude and direction of bias.
Additional information collected in the poll and auxiliary data can be used to build a multiple imputation model or to compute a weight that re-balances the sample of respondents. This methods effectiveness relies on a missing at random (MAR) mechanism; if the auxiliary and additional data are not related to response and the target variable, the adjustments will not be effective (Little & Rubin, 1987). The objective of this paper is to assess to what extent multiple imputation techniques and post-stratification weighting are effective to tackle non-response bias in vote share estimates.
In this paper, we compare two different multiple imputation techniques: univariate which imputes one variable at a time and multivariate in which more than one variable is imputed. The multivariate imputation is carried out using a chained equations method (Royston, 2009). These two conditions are tested independently and using post-stratification weighting. In addition, we try different sets of variables for the imputation models and turnout adjustments. Finally, all the possible combinations of conditions are compared in order to assess the effect on the voting estimates.
This study focuses on the general elections held in Spain in the period 1979-2015. The Spanish case constitutes a good because of the relatively high number of missing values in the voting variable and the possibility of assessing its results. The data used in this paper are the pre and post-election polls carried out by the Spanish Center for Sociological Research (CIS). These studies have had a consistent methodology over time, which makes the cross-sectional data comparable.
Bethlehem, J., Cobben, F., & Schouten, B. (2011). Handbook of Nonresponse in Household Surveys.
Dillman, D., Eltinge, J., Groves, R. M., & Little, R. (2002). Survey nonresponse in design, data collection and analysis. In Survey nonresponse (pp. 3–26). New York: Wiley & Sons,.
Groves, R., & Couper, M. (1998). Nonresponse in household interview surveys.
Little, R. J. A., & Rubin, D. B. (1987). Statistical Analysis with Missing Data. Wiley Series in Probability and Mathematical Statistics (Vol. Second).
Royston, P. (2009). Multiple imputation of missing values: Further update of ice, with an emphasis on categorical variables. Stata Journal, 9(3), 466–477. https://doi.org/The Stata Journal
4. Nonvoters as a missing data problem – Evaluating multiple imputation to predict party preferences
Mr Stefan Haußner (University of Duisburg-Essen)
For years, voter turnout is declining or at least stagnating at a low level at all political levels. The numbers world-wide and for various pan-European decision-making levels in particular are alarming. Although we know much about why citizens cast their vote or do not participate in an election, the consequences of a socially imbalanced voter turnout remain unclear. While it is uncontested that socio-economic status is strongly correlated to participation and the general belief is that primarily parties of the center-left would profit from higher turnout rates, empirical research fails to confirm this political myth convincingly. The current state of quantitative empirical research remains heterogeneous and fragmented. This paper wants to discuss the multiple imputation (MI) method as one more sophisticated approach of dealing with this topic. Regarding nonvoters as missing data, some research suggest this method to replace missing party preferences with reasonable estimations to simulate universal turnout. Thereby one could compare the original election results with the simulated ones. Unfortunately the appropriateness of this application is rarely discussed.
This paper wants to tackle this shortcoming and discusses if the application of the method meets required methodological assumptions, i.e. the type of missingness or the transferability of a voter-model of party preferences on nonvoters. Furthermore the paper evaluates the statistical performance of MI using a validation-set approach with data of the 6th and 7th wave of the European Social Survey and the European Election Studies 2009 and 2014. Thus the research design covers national-level elections as well as European Parliament elections, which are known as elections with exceedingly low turnout rates. Some of the ‘true values’ of the voter-samples are temporarily deleted and new values are imputed for them to cross-validate the amount of ‘correct’ imputations. Additional to this predictive-accuracy criterion the paper uses an approach of Chambers (2011) to evaluate the distributional accuracy, to check how well the method preserves the ‘true’ distribution of party preferences. It is expected that the performance of the MI and the underlying multinomial regression model strongly depends on the amount of parties taking part in the elections. Because of the high complexity of party preference estimation it is presumed that a high number of imputations might be necessary to get a stable estimate. Both hypothesis will be examined.
To sum up, the paper examines the innovative approach regarding party preferences of nonvoters as missing data. The paper discusses potential methodological shortcomings as well as the statistical performance of the MI-method. The objective of the paper is two-folded, providing an added-value to the methodological discussion around MI and furthermore searching for reasons, why quantitative research has been unable to clarify if higher turnout rates would have a considerable impact on election results in Europe.
5. Weighting and reweighting in social surveys
Professor Seppo Laaksonen (Uniiversity of Helsinki)
The survey process consists of many phases. This paper is concerned survey weighting using a terminology that is not completely conventional. The weighting in general is a crucial part of the survey process that should be taken into account from its beginning.
The first proper phase towards weighting is sampling designing in which the inclusion probabilities of each survey stage pay most attention. However, at the same time it is necessary to foster auxiliary information that can be started to collect from the same sources than the sample is drawn. Auxiliary variables at this phase can be either from a macro or micro level. Macro variables are aggregates (statistics) of the sample units whereas micro auxiliary variables are for those units themselves (for individuals). These macro variables are also used for calculating inclusion probabilities. As soon as the gross sample is drawn it is possible to calculate the so-called design weights given that the inclusion probabilities are complete, thus without missing values. This unfortunately is not always the case in multi-stage designs where the primary sampling units are only definitely complete but secondary or tertiary units are not since it is needed to contact these units to get this information.
Our first conclusion thus is that the design weights are complete only if the inclusion probabilities are. It is, however, good to calculate the best possible weights from the inclusion probabilities even though they are not ideal. This can be made assuming that the response mechanism is missing completely at random (MCAR). That is, the respondents, the non-respondents and the ineligible units are similar in the second and later stages. Such weights are here called ‘basic weights’ but some conventional literature uses the term ‘design weights’ that thus are conditional to MCAR. We use the same term also in such usual cases when the MCAR mechanism holds true within strata but not between strata. However, we extend this mechanism further into the term ‘Missing at random conditional to sampling design’ (MARS). It is as a component of the conventional term ‘Missing at random’ (MAR) that is conditional to all auxiliary variables available and used in a survey. The basic weights can be adjusted further if either macro or micro or preferably both auxiliary variables are gathered into the sampling design data file during and after the survey fieldwork.
There are now several strategies for weighting adjustments. Our paper is focused on a framework that helps in reweighting, in particular. We apply the framework also to empirical survey micro data, including the two following weighting 'chains': (i) basic weights as the staring weights in calibration, and (ii) response propensity weights as the starting weights in calibration. We compare the different weights and also some estimates using these weights.