Open Source Interfaces with the Programmable Web Facilitate Replications of Big Data Analyses in Social Science Research

Guest Post: Ulrich Matter and Alois Stutzer, University of Basel


The replicability of social science research is becoming more demanding in the age of big data. First, researchers aiming to replicate a study based on massive data face substantial computational costs. Second, and probably more challenging, they are often confronted with “highly unique” data sets derived and compiled from sources with different and unusual formats (as they are originally generated and recorded for purposes other than data analysis or research). This holds in particular for Internet data from social media, new e-businesses, and digital government. More and more social scientists attempt to exploit these new data sources following ad hoc procedures in the compilation of their data sets.

We propose the use of Open Source Interfaces (OSIs) for the collection of data from the programmable web. OSIs not only facilitate the compilation and preparation of data sets in a format usable with statistical software, but simultaneously substantially reduce the costs for later project expansions and study replications. In order to make our case for OSIs, we first describe how cheaper and faster access to the Internet in combination with a broader acceptance of certain web standards like REST and JSON strongly facilitates the integration of data across applications and systems. We then briefly explain how OSIs allow making better use of these data streams for social science research. As an example, we refer to the OSI pvsR.

The Data Stream through the Programmable Web

One of the most promising big data sources for the social sciences is the programmable web (or semantic web): a part of the Internet, designed to deliver specific information in a dynamic way. The programmable web consists essentially of so-called (web) application programming interfaces (web APIs) that facilitate the integration and exchange of data across applications and between users. If you use your smartphone to search for restaurants in your area and lookup how to get there, an application on your phone exchanges information with the Google Places API as well as potentially several web APIs of different local public transport companies in order to provide you with the necessary information. Web APIs are thus the central nodes between applications in the programmable web. At the same time, they can be the central access points for researchers aiming to systematically collect these data in order to investigate social science research questions. Social science studies based on Twitter data usually rely on the Twitter API in this way. However, the programmable web offers many more APIs than those related to Social Media. Thus, conducting or replicating such a study relies on knowing how to programmatically access such a specific web API. The initial costs of gathering the data might be considered worthwhile for an initial study. However, they pose a substantial hurdle for replication studies.

Open Source Interfaces to web APIs

In order to facilitate the extraction of valuable information from the programmable web, we suggest the programming and provision of OSIs. OSIs are free and easy to use open source software packages written in a language broadly used for scientific data analysis such as R or Python. They facilitate access to data provided through web APIs. An OSI basically takes care of three tasks: sending queries to the API, parsing the response, and mapping the parsed nested web data (often in XML and JSON format) to a table-like format favorable for statistical analyses. In other words, an OSI is a wrapper around an API, specifically written for applications in statistical analysis. The figure below illustrates this point.

journal.pone.0130501.g002

Source: Matter and Stutzer (2015)

With an OSI available for a particular web API, a simple script of how the OSI was applied to query the data used in a specific study (along with the usual script documenting the statistical analyses) is sufficient to replicate the entire study. The development of OSI-like software packages to interact with web APIs has become increasingly popular over the last few years, particularly in the context of the R language. The official R repository (CRAN) and Ropensci provide lists with links to many such contributions (see here and here). In addition, there is an increasing number of contributions that aim to facilitate the underlying tasks of such software packages (see the CRAN Task View on Web Technologies and Services.

Example: Studying US politics via pvsR

In our own work, we programmed an OSI that allows easy access to information about US politics through the API of Project Vote Smart (PVS). PVS informs the US public in great detail on the political process in the United States, including elections, voting behavior, characteristics of candidates and officials, as well as officials’ actions in office, both at the national and at the subnational level. Thereby, PVS generates and prepares a lot of information, which is delivered through its webpage www.votesmart.org. However, PVS also provides access to the data through its API in order to facilitate the integration of its data in other applications. The OSI pvsR provides a simple interface between the R statistical computing environment and the PVS API. Researchers studying American politics can thus directly access these data through pvsR. Additional information about pvsR and an example for a replication can be found in our complementary pvsR paper in open access.

Outlook

The amount of data accessible through the programmable web will continue to rise. With this development, the importance of transparent access to the programmable web for social science research is likely to increase as well. This holds in particular with the advent of embedded systems including sensors and applications in cars or every-day electronic devices that automatically feed data to web APIs. We are convinced that OSIs help to learn more from these new data sources while maintaining our standards of replicability in the social sciences.

For the future, programmatic approaches facilitating the provision of OSIs might thus be of great relevance. A potentially fruitful path to cope with a greater demand for OSIs, is the development of a generic approach to facilitate the most central tasks of OSIs. A suggestion for such a generic approach is RWebData. More on this later.

2 thoughts on “Open Source Interfaces with the Programmable Web Facilitate Replications of Big Data Analyses in Social Science Research

Leave a Reply