ESRA logo
Tuesday 14th July      Wednesday 15th July      Thursday 16th July      Friday 17th July     

Thursday 16th July, 14:00 - 15:30 Room: O-201

Big Data and Survey Research

Convenor Mr Yamil Nares (Institute for Social & Economic Research (ISER) )
Coordinator 1Dr Tarek Al Baghal (Institute for Social & Economic Research (ISER))

Session Details

As the demand for data has increased in recent years, so has the potential amount of information produced by new technologies. The volume of additional behavioural data produced through people's interactions with technological innovations (smartphones, tablets, laptops), as well as the trace of people activities through emails, financial measurements, real time posts images, and videos sharing on social media (Twitter, Facebook, Google Plus, Instagram, YouTube) has garnered the name "Big Data". As the potential uses of “Big Data” seemingly provide a great opportunity to examine and study social behaviour, including changes over time, many have argued that such "Big Data" can largely replace surveys, and may be of better quality, given the problems surveys face such as increasing nonresponse. However, “Big Data” is frequently used at the macro-level, while surveys also provide micro-level data, and has more constraint on the types of information collected. To date, there have been few empirical examinations of the comparative efficacy of “Big Data” and surveys, or how these may supplement or replace the other. With the emergence of the arguments surrounding “Big Data”, given the limited amount of research conducted, it is important for social researchers to better understand the comparative and complementary aspects of these data sources, as well as data protection and privacy policies. The current panel calls for papers that examine these issues, particularly those using survey and “Big Data” in a comparative or complementary manner, examining likely sources of error and best practices. While empirical analysis will help further gain a scientific base for deciding on how to use these data better, so will papers discussing these issues theoretically. Possible issues include those of coverage, measurement, and other sources of error; examinations of how to use “Big Data” in combination with surveys; and issues of data protection.

Paper Details

1. Big Data Analytics: Enumerating the Risks to Data Quality
Dr Craig Hill (RTI International)
Dr Paul Biemer (RTI International)

Big Data involve massive amounts of high-dimensional and unstructured data that bring both new opportunities and new challenges to the data analyst. But, Big Data are often selective, incomplete and erroneous. New errors can be introduced downstream as the data are cleaned, integrated, transformed, and analyzed. We present a total error model for Big Data that enumerates many of the risks of false inferences in the analysis of Big Data. Some approaches for minimizing these risks gleaned from more than a century of experience with analyzing and processing survey data will be described.

2. Mode Preferences in Business Surveys: Evidence from Germany
Dr Christian Seiler (Ifo Institute)

With the world-wide spread of the internet in the 1990s, the conduction of web or e-mail surveys became popular in research. Although these surveys provide fast data collection and reduced costs, results may suffer from biases due to the survey mode. While a variety of studies concerning mode effects in household or individual surveys exists, only less is known in case of business surveys. Our results show that e-mail or web surveys reduce nonresponse and are more likely used by larger firms which operate in technology-related business areas.

3. Extracting time measures from log-data of questionnaires and complex tests: Comparing Response times in the PISA 2015 Field Trial context assessment between countries
Dr Ulf Kroehne (German Institute for International Educational Research (DIPF), Frankfurt am Main (Germany))
Dr Susanne Kuger (German Institute for International Educational Research (DIPF), Frankfurt am Main (Germany))
Professor Frank Goldhammer (German Institute for International Educational Research (DIPF), Frankfurt am Main (Germany) and Centre for International Student Assessment (ZIB), Germany)

The extraction of response times from log-data is less straightforward when multiple questions are placed on a single screen and if the instrument design, like the one used for the computer-based context assessment of PISA 2015, offers degrees of freedom, e.g., individual sequences, free navigation and item review. Different time indicators (e.g., time for loading, reading, confirmation, …) will be defined theoretically, evaluated and then used to investigate empirically similarities and differences in the response behaviour of about 15,000 teachers and about 120,000 students in the PISA 2015 Field Trial among 62 countries.

4. SKEY - Statistical Key Value Data Model
Mr Alessandro Capezzuoli (ISTAT)

Statistical data, often, are disseminated  with different standard from a web platform (website, webservices) . Normally, data are store into a relational database or datawarehouse. Each statistical data  sources have a different own relational database schema and DBMS (Oracle, Mysql, Postgres, MSSQL). Each dataset (data and meta data) are disseminate in many way (Restful, soap) and many format (XML, JSON, CSV). Statistical data and metadata can be represent as an object consisting of a key/value pair.

5. Linking Social Media to Survey Responses: Possible Issues and Potential Uses
Dr Tarek Al Baghal (University of Essex)
Mr Yamil Nares (University of Essex)

This presentation will discuss how micro-level data linking survey responses to social media can be created, including issues of data linkage consent, ethics, collecting the social media data, and methods for converting these into meaningful measures. The necessary text analytic methods can capture and create a number of potentially useful variables, including improved proxy variables. These measures can then be used in methods for nonresponse adjustment. The presentation will discuss how to evaluate these measures and how these fair in nonresponse correction. As an example, measures from aggregated Twitter data are compared and correlated with Understanding Society data.