ESRA logo
Tuesday 18th July      Wednesday 19th July      Thursday 20th July      Friday 21th July     




Tuesday 18th July, 11:00 - 12:30 Room: F2 102


How surveys and big data can work together 2

Chair Dr Mario Callegaro (Google )
Coordinator 1Dr Yongwei Yang (Google)

Session Details

With more and more data available for answering research questions, the role of surveys in the broader context of social research needs to be redefined.

In this call for papers we are looking for works that bridge the use of big data and survey data, showing how the different signals can work together.

While varying definitions of “big data” exist, for the purpose of this session, we encourage authors to consider a helpful definition by Groves (2011) about organic data, described in contrast to designed data. In this context, there are five common sources of big data (but note the types of data available from these sources may overlap):
-Internet data. These data may take the form of online texts or multimedia data. They may come as social media data or as the by products of online activities - website metadata, logs, cookies, and website analytics.
-The Internet of Things (IoT). These are data produced by those devices that can communicate with another using the internet with some common transmission protocol. A subset of such data, called behavioral data, is particularly useful for research purposes. These data come from connected devices (smartphones, wearables, etc.) and capture information about locations, physical activities, health status, etc. that are passively or actively recorded by the users.
-Transaction data. These are digitally captured or stored records of interactions or transactions between individuals and a business, government or not-for-profit entity or among those entities.
-Administrative data. These are information collected and aggregated by public offices such as national health, tax, school, benefits, etc.
-Commercially available databases. These are curated data made available by data broker companies who combine data from the above sources.

For this session, we are looking for practical applications of that combine surveys in combination with any type of Big Data that would leverage the strengths and mitigate the weakness of each. Examples include:
-Using high quality surveys to validate the quality of Big Data sources.
-Using Big Data to ask better questions in surveys.
-Combining both data types to enhance construct coverage or measurement precision
-The relation between Total Survey Error and Big Data Total Error

Papers aiming at theoretical issues (errors, validity, utility, causal inferences, model building/testing, etc.) are also welcomed, but preferably to be supported by good empirical evidence or sound simulations.

Paper Details

1. Amplifying survey results with Google Trends data
Mr Jeffrey Oldham (Google, Inc.)
Mr Hal Varian (Google, Inc.)

We show how to combine a US-wide survey with Google Trends data to yield out-of-sample predictions for finer-grained geographies such as cities, market areas, and states. This post-stratification technique yields predictions with uniform, small confidence intervals.

First, survey responses are collected each with the geography of the respondent. Combine with Google Trends data, indexed by geography and by search vertical, and then perform variable selection to determine which subset of verticals best model the survey responses. Variable selection takes advantage of the geographic differences in Google Web Search; e.g., the number of searches for each search category may differ in university towns from towns with strong vehicle manufacturing. Typically, a handful of verticals model the survey responses.

A survey's verticals themselves provide insight. E.g., those who supported Barack Obama before the 2012 U.S. presidential election were interested in books, basketball, student loans, and hip-hop but negatively interested in Christianity and coupons.

Using Google Trends data at the geographic level, out-of-sample predictions for fine-grained geographies' responses such as cities, market area, and states can be computed. Obtaining this level of prediction just using a survey would greatly increase the survey cost. We call this "survey amplification".

Marketers can identify target audiences through amplifying survey results to yield, for each geo, the fraction of inhabitants likely responsive to the desired survey answer. Combining these predictions with the cost of advertising yields cost-effective target areas at the cost of a few hundred or a few thousand dollars to survey.

If one has market sales data broken down by geos, amplification can model sales geographically. Underperforming markets, i.e., geos with lower actual sales than predicted, may need special treatment to improve sales.

Amplification to obtain predictions, vertical models, identification of underperformers, and extrapolations preserves user privacy. Survey results need only be denoted with the associated geographies. Google Trends data aggregates trillions of web searches so privacy is preserved. It also is inexpensive, requiring just a few hundred or a few thousand dollars to obtain survey results.


2. Correcting for misclassification under edit restrictions in combined survey-register data using Multiple Imputation Latent Class modelling (MILC)
Mrs Laura Boeschoten (Tilburg University and Statistics Netherlands)
Dr Daniel Oberski (Utrecht University)
Professor Ton de Waal (Statistics Netherlands and Tilburg University)
Dr Marcel Croon (Tilburg University)

National Statistical Institutes (NSIs) often use large datasets to estimate population tables on many different aspects of society. A way to create these rich datasets as efficiently and cost effectively as possible is by utilizing already available administrative data. When more information is required than already available, administrative data can be supplemented with survey data. Caution is advised as both surveys and administrative data can contain classification errors.

Therefore, we developed a method which combines multiple imputation (MI) and Latent Class (LC) analysis (MILC) to estimate the number of classification errors in combined datasets and simultaneously imputes a new variable which takes the uncertainty caused by these classification errors into account. With this method it is possible to obtain estimates that are consistent and that take impossible combinations, such as pregnant males, into account. Such impossible combinations are often referred to as edit rules. Taking edit rules into account is especially useful within official statistics since cells in cross tables that represent a combination of scores that is in practice impossible should contain zero observations.

However, the MILC method only obtains consistent estimates and only takes edit rules into account for variables that were included in the LC model. For variables that are not included into the LC model, estimates may be inconsistent and impossible combinations may occur. We now extend the MILC method so it incorporates stepwise LC modelling, which makes it possible to estimate relations or apply edit rules with covariates that were not taken into account by the initial LC model

In this paper, we illustrate how we incorporated the three-step approach into the MILC method, we discuss the methodology of the three-step approach and the MILC method. We perform a simulation study to investigate the performance of the MILC method in combination with the three-step appraoch, and we apply the method in practice.


3. Can we learn from user-created online surveys?
Miss Sarah Cho (SurveyMonkey)

Online survey platforms have made it possible for anyone to quickly and cheaply create, send, and analyze data from their own surveys. Every day, SurveyMonkey users around the world send out 24,000 surveys and 3 million responses are collected on SurveyMonkey. This research looks at the characteristics of these user-created surveys and the experience of respondents, to glean insights on how online surveys can be improved.

We will examine current trends in surveys, from the perspective of both the survey creator and survey taker. Using our database of surveys, we will produce estimates of average completion rate, completion time, and survey topic and examine how those items differ by survey design (length, question type). We will further segment this analysis by desktop versus mobile survey takers.