ESRA logo
Tuesday 18th July      Wednesday 19th July      Thursday 20th July      Friday 21th July     




Tuesday 18th July, 09:00 - 10:30 Room: F2 102


How surveys and big data can work together 1

Chair Dr Mario Callegaro (Google )
Coordinator 1Dr Yongwei Yang (Google)

Session Details

With more and more data available for answering research questions, the role of surveys in the broader context of social research needs to be redefined.

In this call for papers we are looking for works that bridge the use of big data and survey data, showing how the different signals can work together.

While varying definitions of “big data” exist, for the purpose of this session, we encourage authors to consider a helpful definition by Groves (2011) about organic data, described in contrast to designed data. In this context, there are five common sources of big data (but note the types of data available from these sources may overlap):
-Internet data. These data may take the form of online texts or multimedia data. They may come as social media data or as the by products of online activities - website metadata, logs, cookies, and website analytics.
-The Internet of Things (IoT). These are data produced by those devices that can communicate with another using the internet with some common transmission protocol. A subset of such data, called behavioral data, is particularly useful for research purposes. These data come from connected devices (smartphones, wearables, etc.) and capture information about locations, physical activities, health status, etc. that are passively or actively recorded by the users.
-Transaction data. These are digitally captured or stored records of interactions or transactions between individuals and a business, government or not-for-profit entity or among those entities.
-Administrative data. These are information collected and aggregated by public offices such as national health, tax, school, benefits, etc.
-Commercially available databases. These are curated data made available by data broker companies who combine data from the above sources.

For this session, we are looking for practical applications of that combine surveys in combination with any type of Big Data that would leverage the strengths and mitigate the weakness of each. Examples include:
-Using high quality surveys to validate the quality of Big Data sources.
-Using Big Data to ask better questions in surveys.
-Combining both data types to enhance construct coverage or measurement precision
-The relation between Total Survey Error and Big Data Total Error

Papers aiming at theoretical issues (errors, validity, utility, causal inferences, model building/testing, etc.) are also welcomed, but preferably to be supported by good empirical evidence or sound simulations.

Paper Details

1. Unifying Survey Data and Big Data: A Total Quality Perspective
Dr Paul Biemer (RTI International)
Ms Ashley Amaya (RTI International)

This paper considers the strengths and weaknesses of Big Data, survey data and “unified” data (i.e., the combination of survey data with Big Data) through the lens of a total quality framework. This perspective allows data producers to more objectively and comprehensively compare the whole range of quality dimensions and error sources across data sources to identify their relative strengths and weaknesses. Researchers can then use this information to identify methods for combining the different data sources in ways that leverage their positive and diminish their negative attributes. The end product is a unified data set with superior total quality than any of its source components. The paper presents interesting results from a survey conducted at the recent International Total Survey Error Workshop (ITSEW, 2016 held in Sydney, Australia) which sought to quantify the quality aspects of Big Data and survey data across ten quality dimensions. The total quality framework is then applied to a unified data set developed by the authors for a key variable in the 2015 U.S. Residential Energy Consumption Survey (RECS 2015). In this study, survey data on dwelling unit area was combined with data from an online real estate database to reduce measurement error, nonresponse, sampling error and possibly other errors. The paper reports the results investigating whether the unified data estimates were of superior quality than estimates derived from these individual data sources.


2. What Can Survey Research and Big Data Do for Each Other: Combining Data to Decrease Total Error?
Dr Daniela Hochfellner (New York University & University of Michigan)
Dr Antje Kirchner (RTI International )

In the past couple of years the demand for big data in social sciences has increased tremendously. Especially in survey operations, big data are considered a valuable resource to mitigate errors in the survey process and empirical analyses and vice versa. On the on hand, surveys allow us to assess where big data show coverage problems. For example, using the American Community Survey information on internet, smartphone or computer usage shows spatial areas where a specific part of the population doesn't have internet access. These populations are most likely not included in for example media data or tracking data. On the other hand, studies show that big data can be used to improve survey sampling frames, to assess nonresponse error, nonresponse adjustment or responsive designs. For example, open data such as reported crime, access to public recreation facilities (parks, pools, etc), online state administrative records for homeowners, wifi hotspots, etc. can be used to collect more information for non-respondents. This paper discusses the potential of combining survey and big data along the Total Survey Error and Total Big Data Error Framework by showing practical applications in the field of survey research.


3. Does Big Data mean Big Problems, or Bigger Opportunities?
Mr Pedro Cunha (INE)
Mrs Sonia Quaresma Gonçalves (INE)
Mr Jorge Magalhães (INE)

The traditional approach to design a survey with all the necessary information for the production of a deliverable, be it a publication, an indicator for the national purpose, or a deliverable to fulfil our national obligations to Eurostat or other institutions is no longer maintainable. Life has changed in the last years not only on the perception of how the use of administrative data could lessen the burden on the respondents and the survey costs but also because the gathering of information through sensors and other automated devices can constitute a useful resource.

Many questions arise when we try to use data, be it administrative or sensor generated, to produce official statistics. Most of them because the data was not collected with statistical purposes and methodologically its use can be complex as it does not meet statistical standards on concepts or definitions. There are also selectivity problems and usually no guarantee on the source continuity and stability.

When a Big Data holder decides to change definitions, collect different data or entirely stop their data collection the statistical offices usually have no leverage to prevent the loss of such data. For this reason our approach to Big Data that we don’t collect ourselves, through Web Scraping for example, has been cautious and gradual.

After the census 2011 operation a national address data base was built with the results from the buildings census survey conducted in parallel with the population census. Since then this database has been enriched with administrative data coming from the city councils and several surveys directed targeted at construction promoters and such. We are now considering its enrichment with a big data source on the electricity consumption. This information would be able to tell us if a household is primary or secondary, once a threshold for electricity consumption is established.

Due to the aforementioned reasons we are looking only at a very small portion of the big data available, since the smart meters can generate a huge amount of data, but it can change without much time notice. To make an investment on an unstable source would not be a sound strategic decision at this point so we are focusing on the address to charge which is the main interest of our Big Data Source holder and the amount of electricity that should be charged.

Of course even in this case we try to collaborate with the data holders to introduce some safeguards at the same time that we try to develop the new methodologies that will allow us to combine the sources. Different quality checks to the data extracted from Big Data sources have to be applied and the combination of this quality checks and new methodologies are addressed in the specific case of the national addresses’ data base that we present you in this paper.


4. Media Exposure and Opinion Formation In an Age of Information Overload
Dr Simon Munzert (University of Mannheim, Mannheim Centre for European Social Research)
Professor Pablo Barberá (University of Southern California)
Dr Andrew Guess (New York University)
Mr JungHwan Yang (University of Wisconsin-Madison)

We present first results from a major online panel tracking study that aims to investigate media exposure and opinion formation in an age of information overload. The overarching project asks if online sources and social media platforms exacerbate segregation, polarization, and inequality in political knowledge and behavior, and what role weak ties play in moderating citizens’ information diet. To that end, we study how offline events are translated into media coverage, people decide to consume or avoid coverage, and how these decisions affect attitudes and behavior. We combine passive metering technology to capture the online media consumption of a representative sample of individuals in Germany and the United States, draw on machine learning and natural language processing methods to estimate the topic and ideological slant of each separate piece of content consumed, and in parallel directly survey the panelists at regular intervals to monitor changes in issue attention, opinion, and political knowledge.