ESRA logo

ESRA 2019 glance program


Small Area Estimation in the Era of Big Data, Crowdsourced Data and Non-Probability Web Surveys. Topics in Poverty, Social Exclusion and Crime

Session Organisers Dr Angelo Moretti (University of Manchester)
Mr David Buil-Gil (University of Manchester)
TimeFriday 19th July, 11:00 - 12:30
Room D18

Small area estimation methods are gaining relevance in academic research in the social sciences due to the growing need for reliable estimates at small geographical levels and for small domains. In the study of poverty, social exclusion and crime, policy-makers require detailed information about the geographical distribution of a large variety of social indicators (e.g. fear of crime, crime rates, poverty measures, unemployment). Unfortunately, the most diffused social sample surveys are not designed to be representative at small area level. Indeed, we are in the presence of the so-called “unplanned domains” phenomenon, where domain membership is not incorporated in the sampling design. Thus, the sample size in each domain is random (and may be large or small) and in many cases zero. In this latter case, design-based estimation methods may produce a large variability in the estimates. Here, indirect model-based estimation methods, in particular small area estimation approaches, can be used to predict target parameters for the small areas.
The Internet and the new information and communication technologies offer new opportunities to collect data on social problems. These data refer to non-probabilistic samples, big data, open data and crowdsourced data. Such data offers many advantages over traditional approaches to data collection. However, open and crowdsourced data are criticised due to biases arising from participants self-selection and due to non-probability sampling designs. These new forms of data can be used in different ways in the context of small area estimation techniques:
1. As covariates in small area models.
2. To validate small area estimates (external validation).
3. As target variables in small area models.
In this session we particular welcome substantive and methodological papers that address these issues in poverty, social exclusion and crime. Moreover, we are interested in how to provide measures of uncertainty of the small area estimates obtained using these data (mean squared error or confidence intervals). Applications based on relevant data for users are important.

Keywords: model-based estimation, non-random samples, crime, social exclusion, measurement

Poverty Mapping in Small Areas: Complex Sampling Problems

Dr Isabel Molina (Department of Statistics, Universidad Carlos III de Madrid) - Presenting Author
Dr María Guadarrama (Luxembourg Institute of Socio-Economic Research (LISER))
Professor J.N.K. Rao (School of Mathematics and Statistics, Carleton University)

Download presentation

Regional poverty maps are used to aid governments, supranational institutions and international organizations to design, apply and monitor more effectively regional development policies. However, the sample size limitation of national surveys prevents from producing statistical figures for highly disaggregated areas using the conventional survey estimators, especially when the survey was not planned to produce estimates for the target areas. This problem has led to the development of specific methods for estimation in areas of small survey sample sizes, when the overall survey sample size is large enough. These methods provide much more stable estimates in the sense of much higher efficiency, at the expense of small increases in design bias. Here we describe specific estimation procedures for small areas that can be applied to general non-linear indicators including poverty and/or inequality indicators, under complex sampling problems. In particular, we consider informative designs, where the probabilities of selection of individuals for the sample depend on the values of the target variable, and the case of cut-off sampling, where part of the target population is consciously excluded from the selection. We describe methods that reduce the design bias due to these problems and study their properties. We illustrate the methods through applications with real data. We also provide some thoughts on the use of Big Data and non-probability samples in small area estimation.


Small Area Poverty Indicators Adjusted Using Local Price Indexes

Dr Gaia Bertarelli (University of Pisa/Centre ASESD Camilo Dagum)
Dr Caterina Giusti (University of Pisa/Centre ASESD Camilo Dagum)
Professor Monica Pratesi (University of Pisa/Centre ASESD Camilo Dagum) - Presenting Author
Dr Stefano Marchetti (University of Pisa/Centre ASESD Camilo Dagum)
Dr Francesco Schirripa Spagnolo (University of Pisa/Centre ASESD Camilo Dagum)

Policy makers and stakeholders require reliable measures and indicators in order to design, implement and evaluate intervention policies effectively both at national and local level. Usually, to obtain reliable estimates at local level require to resort to small area estimation techniques.
In this work we focus on estimating the incidence and intensity of monetary poverty at sub-regional level taking into account the different price level within the country. Indeed, in Italy the North-South divide has turned off into a strong different price level, which can affect the poverty threshold.
The local price level is accounted by purchasing power parity indexes computed at sub-regional level from two different big data sources: 1. from retail scan data on regional and sub-regional retail volumes (units) and price for food and grocery (Istat/Nielsen), and 2. the transaction prices of houses made available by the Revenue Agency at Italian sub-municipality level (Revenue Agency - OMI).
Sub-regional level poverty estimates will be obtained using area level small area models, which link direct unreliable estimates to aggregated auxiliary information, often easily available.
This work is motivated by previous evidence of the influence of sub-national (regional) price indexes on local poverty threshold, depicting a different poverty scenario then that obtained without considering them. This work is partially supported by the MAKSWELL European project G.A. no. 770643.


Calibrating Big Data for Population Inference

Mr Ali Rafei (University of Michigan)
Dr Carol Flannagan (University of Michigan)
Professor Michael Elliott (University of Michigan) - Presenting Author

Although probability sampling has been the “gold standard” of population inference, rising costs and downward trends in responses rate has led to a growing interest in non-probability samples. Nonprobability samples, however, can suffer from selection bias. Here we develop “quasi-randomization” weights to improve the representativeness of non-probability samples. This method assumes the non-probability sample actually has an unknown probability sampling mechanism that can be estimated using a reference probability survey. We apply the proposed method to improve the representativeness of the University of Michigan Transportation Research Institute Safety Pilot Study, which consists of a convenience sample of over 3,000 vehicles that were instrumented and followed for an average of one year, using the National Household Transportation Survey as our probability sample of drivers.


Model-Based Methods for Combining Probability and Non-Probability Samples

Dr Nadarajasundaram Ganesh (NORC at the University of Chicago) - Presenting Author
Dr Edward Mulrow (NORC at the University of Chicago)
Ms Vicki Pineau (NORC at the University of Chicago)
Dr Michael Yang (NORC at the University of Chicago)

Probability sampling has been the standard basis for design-based inference from a sample to a target population. In the era of big data and increasing data collection costs, however, there has been growing demand for methods to combine data from probability and nonprobability samples in order to improve the cost efficiency of survey estimation without loss of statistical accuracy. In this presentation, we provide an overview of our previous work on using area-level small area models to combine probability and non-probability samples assuming the smaller probability sample yields unbiased estimates. One key assumption in area-level small area models is that the sampling variances are known; however, in the context of non-probability samples, there is no known methodology to estimate sampling variability. We use a large U.S. midterm election survey to empirically evaluate different estimators for the sampling variance associated with the non-probability sample and how it impacts the model-based small area estimator; specifically, for different model-based small area estimators, we compare the coverage probability against nominal coverage to identify a “best” estimator (that would be used in the small area model) for the sampling variability associated with the non-probability sample.


City Data from LFS and Big Data

Ms Sandra Hadam (Federal Statistical Office) - Presenting Author
Professor Timo Schmid (Freie Universität Berlin)

Reliable knowledge on labour force indicators of a country’s population is essential for sound evidence-based policymaking. For instance, the geographic distribution of employment or unemployment rate is used to make decisions regarding the allocation of resources. The Labour Force Survey (LFS) is generally designed to provide reliable estimates for larger domains such as the national or regional level. However, to make policy proposals in urban areas we have to provide information for these areas. One possible way to derive estimates on spatially disaggregated levels is by using small area methods. The production of precise small area estimates relies on the availability of predictive auxiliary variables. Therefore, in addition to the usage of LFS information, alternative sources of passively collected mobile phone data will be used. The main idea for this application is to use anonymized and aggregated mobile phone data of the German Telekom as auxiliary variables to estimate LFS indicators for functional urban areas. The methodology depends on the approach and procedure of Schmid et al.(2017), who predicted socio-demographic indicators by using mobile phone data for Senegal in combination with survey data. The motivation for using mobile phone data is that they are collected without interruptions and include valuable information on the timing, location and intensities of aggregated mobile events. Since we possess aggregated mobile phone data for Germany, we are able to predict the employment and unemployment rate at spatially disaggregated levels for all functional urban areas. For this purpose we use an area-level model, the Fay-Herriot model, in combination with covariates from mobile phone data. Since the aggregated estimates on regional level can differ substantially from the corresponding direct estimator, a benchmark approach is used to achieve the internal consistency with the direct estimator on regional level.