All time references are in CEST
Linking Survey Data with Geospatial Data: Potentials, Methods, and Challenges 1
|Session Organisers|| Professor Simon Kühne (Bielefeld University)
Mr Dorian Tsolak (Bielefeld University)
|Time||Thursday 20 July, 16:00 - 17:30|
Adding geospatial information to survey data offers new perspectives for social science research. It allows researchers to address regional context effects that play an important role in many research areas. Moreover, combining individual survey data with (aggregated) regional data can aid in closing data gaps and reducing study costs. Over the past years, survey projects have been increasingly enriched with geo-information, for instance, by adding geopositions or municipality codes of survey participants’ residence. In addition, many external data sources of regional indicators across various levels of aggregation are readily available. This includes, but is not limited to, administrative data, social media data, smart data, or satellite imagery. However, survey practitioners are still facing many challenges in managing, linking, and analyzing survey data in conjunction with geospatial data, thus, lots of potential for innovative research which combines survey and geodata remains untapped.
The proposed conference session offers the opportunity to exchange expertise on new developments in linking survey data with geospatial information and regional indicators. Possible questions to be discussed are:
- What are potential sources of regional indicators and geospatial information that can be combined with survey data?
- What methods and procedures exist to retrieve and manage big spatial data from online sources?
- How to harmonize area changes over time when analyzing longitudinal survey data?
- How to assure the high data security standards needed for sensitive georeferenced survey data?
- Which techniques from geospatial analytics can be incorporated in common methods for cross-sectional and longitudinal survey data analysis?
- What are applications that highlight the potential of analyzing georeferenced survey data?
Keywords: survey data, geospatial data, record linkage, regional analytics
Dr Sebastian Bähr (Institute for Employment Research (IAB)) - Presenting Author
Dr Jonas Beste (Institute for Employment Research (IAB))
The neighborhood is a significant factor influencing individual behavior and attitudes. To analyze this relationship, researchers combine individual-level data with spatially aggregated information. Previous research usually used administrative units as a spatial basis, which opens up the modifiable area unit problem: individuals might reside on the fringe of a given spatial unit, so its characteristics might not be meaningful for the individual's life. This issue becomes even more apparent when using higher-level spatial units. Also, individuals themselves contribute to the aggregated data rendering it endogenous in statistical analyses. Lastly, what constitutes a neighborhood often depends on the research question and the rural or urban study area. Often this decision is driven by data availability, with studies either focusing on the small-scale neighborhood level in a limited study area or high-level spatial areas in nationwide studies.
To overcome these issues, we use geocoded administrative data from the German Federal employment agency to generate small-scale spatial information and combine it with multiple years of survey data from the German-wide panel study "Labour Market and Social Security" (PASS). With the help of kernel density estimation (KDE) techniques, we generate aggregated characteristics of the immediate living environment (e.g., on foreigners or the unemployed) around the household addresses of PASS participants. Thus, we ensure the relevancy of the spatial information for the individuals. Using an adaptive algorithm for the estimation bandwidth, we account for different neighborhood sizes in rural and urban areas. We exclude the PASS households from the KDE and thus generate (in this regard) exogenous spatial information.
Our approach enables us to generate tailored information on the composition of the immediate neighborhood of the PASS participants in terms of sociodemographic and labor market characteristics. We will present the project and give examples of the combined analysis of neighborhood information and individual outcomes.
Professor Henning Best (RPTU Kaiserslautern) - Presenting Author
Dr Ehler Ingmar (RPTU Kaiserslautern)
Professor Tobias Rüttenauer (UCL London)
Dr Felix Bader (BIS Berlin)
We link geo-coded survey data to spatial models for the distribution of air pollution to study Environmental Inequality. In the latest study, we use geo-referenced longitudinal household-level data from the British Household Panel (Understanding Society) for 2009-2019 and the German Socio-Economic Panel (SOEP) for 2008-2016, and link it with estimates of annual air pollution. For this, we construct an index from official pollution estimates provided by the German Environmental Office (Umweltbundesamt) and the British Department for Environment, Food, and Rural Affairs (Defra).
A common problem here is that different methods of anonymization on the side of the respective survey institutes, masking the precise locations of respondents, lead to different levels of precision. For SOEP data, we can link the gridded geo-data of the pollution model to unaltered point coordinates for each household. However, for anonymity reasons, we are only allowed to assess the coordinates separately from the final analysis dataset, thus we cannot apply geospatial analysis techniques, like testing for spatial auto-correlation. For BHPS data, we have to re-project the pollution data onto the level of Lower Layer Super Output Areas comprising 1500 households on average, therefore the matching procedure causes some additional imprecision.
There is no simple measure to represent this in the results. Ideally, we would like to have an estimate of overall uncertainty of our results due to different causes – sampling error, spatial imprecision due to anonymization methods, and imprecision of the underlying models of emission and dispersion of air pollution. There is no established procedure for dealing with this kind of uncertainity yet. We report each source of potential imprecision separately and discuss possible alternatives.
Mrs Shruti Jain (Atlas AI) - Presenting Author
Dr Talip Kilic (World Bank)
Dr Abera Muhamed (Atlas AI)
Ms Siobhan Murray (World Bank)
Dr Vivek Sakhrani (Atlas AI)
With the surge in publicly available high-resolution satellite imagery, there is increasing demand for the use of satellite-based monitoring of agricultural outcomes in developing countries that rely heavily on smallholder agriculture as a key source of employment and incomes. This presentation will draw on two parallel and on-going streams of research that aim to provide recommendations on how large-scale household surveys should be conducted to generate the data needed to train models for satellite-based crop type mapping and crop yield estimation in smallholder farming systems, with a focus on key cereal crops, namely maize, sorghum, millet, rice, teff, wheat and barley. The analysis leverages rich, georeferenced plot-level data from national household surveys that were conducted in Ethiopia, Malawi and Mali by the respective national statistical offices over the period 2017–20 and that are integrated with Sentinel-2 satellite imagery and complementary geospatial data. Depending on the country and the topic (crop area versus crop yield estimation), up to 26,250 in silico experiments are simulated within a machine learning framework to identify (i) the optimal approach to georeferencing smallholder plots, (ii) the minimum volume of survey data required to attain acceptable measures of model performance, (iii) the effects of enforcing plot area thresholds in survey data used for model calibration, (iv) the relative utility of radar versus optical satellite data, and (v) the sensitivity of modeled estimates to the scope of geospatial predictors and the approach to machine learning.
Dr Stefan Jünger (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Spatially linking georeferenced survey data with small-scale geospatial data can be pretty time-consuming. Researchers must ensure that the attributes from these operations are suitable for further analysis as population sample data from surveys are sparsely scattered in space, which can lead to missing or skewed data. Moreover, all these steps cannot be done in researchers' work offices, involving data preparation, processing, and analysis. Instead, due to data sensitivity issues, researchers must travel to on-site facilities of research data centers, resulting in even longer waiting times. Guest workplaces are rare, and even modern solutions such as remote desktops often cannot be exploited because georeferenced survey data are too sensitive to be distributed via these technologies. In a nutshell, researchers must compete with other researchers about guest workplaces, and research data centers have scaling issues in offering their data. Is there anything that can be done? In my talk, I will present synthetic data as one solution to reduce the on-site time for individual researchers, leading to more free spots for others. These synthetic data mimic the geographic structure of the georeferenced survey data but can be used by researchers in their offices, at home, or wherever they prefer. Thus, all steps involving Geographic Information Systems and the final analysis can be prepared ahead of time and do not have to be carried out on-site. In this scenario, the on-site visit only serves as a last run on the actual data to produce results ready to be publishe. I will present a framework for this synthetic data solution and provide examples based on the georeferenced data from GESIS. Finally, I will discuss the approach as a perfect ground to prevent researchers from p-hacking and HARKing, enabling them to follow the ideal of pre-registrated studies.
Mr Dennis Abel (GESIS - Leibniz Institute for the Social Sciences) - Presenting Author
Mr Stefan Juenger (GESIS - Leibniz Institute for the Social Sciences)
Social scientific interest in environmental indicators to measure attitudes and behaviour in the context of climate change, the energy transition, local pollution or biodiversity loss has grown significantly in recent years. This requires the linking of geo-referenced survey data with precisely these influencing factors. For the majority of researchers in the social sciences, however, earth observation and satellite data still represent a black box. "GESIS meets Copernicus" (GxC) addresses this gap and aims to create a (semi-)automated interface to the European Earth Observation Programme "Copernicus" and other earth observation databases for social science researchers. The core of the project is the creation of a data tool in R for the user-friendly linking of social science data with data from earth observation programmes. At the same time, the project lends itself to providing GESIS's own survey data sets (for example GESIS Panel and GLES) and digital behavioural data with a standard battery of spatially and temporally explicit geodata (weather, climate, atmosphere, land use, etc.). The project aims to support environmental social science research by providing a data tool that does not require advanced knowledge in processing earth observation data. The aim is to make linkage possibilities more flexible and to give researchers greater freedom over spatial scales and scope as well as temporal specifications. In doing so, we address a central problem that geodata often exist in separate data silos of national authorities.
The lecture will present the workflow and the use of the tool. Based on this, we will discuss a use case by linking a social science survey data set to the developed indicators and elaborate on the potential fields of application.