Repeatable Data Science: A Demo

Data science is repeatable if results can be reproduced on demand. This is an core tenet of good science; without repeatability it’s unclear whether changes will make things better, worse, or have no effect at all! Reproducing results relies on two attributes of an analysis: 1) that the same inputs will yield the same outputs every time they are applied, 2) that inputs are known and accessible.

Simply doing data science in R or Python vastly increases the likelihood that the same inputs will yield the same outputs (as long as you’ve been careful about random number generation). However, ensuring that inputs are recoverable and sharable - can be much harder.

There are a variety of strategies for increasing input availability - making the data in an analysis centrally available and accessbile to anyone who needs it while keeping it secure.

Thanks to the pins package, which gained support in RStudio Connect as of version 1.7.8, it’s easier than ever to have repeatable data.

What is Pins? #

Pins is an R package that makes it possible to remotely save (“pin”) any object serializable by R, like a data frame or model object. These objects are saved to a “board” such as RStudio Connect. Once a pin is deployed to RStudio Connect, you can use the standard RStudio Connect access controls to share it with others.

Pins is particularly useful for sharing R objects when the objects are

  1. Relatively small (a few hundred megabytes at most)
  2. Reused across multiple pieces of content
  3. Only needed in their most current form

Things like an auxiliary data frame for an analysis or a statistical model are particularly good candidates for pins.

The Bike Prediction App #

The Bike Prediction app displays the number of bikes predicted to be at the various docks of Washington DC’s bikeshare program in the near future.

In this app, the user can click on a dock on the map (built using the leaflet package) and get the predicted number of bikes at that station in the near future in the bottom half of the page.

The production bike prediction app

Using Pins with Repeatable Data #

The Bike Prediction App uses pins in two ways.

Metadata File #

The app makes use of a metadata file for the stations in the bikeshare system. The station info data frame contains a mapping from the numeric stations ids to their names, latitudes, and longitudes. The data frame is stored as a pin on RStudio Connect and is updated every week by a scheduled R Markdown document on RStudio Connect. The security of the app is improved by securely accessing the pin with an RStudio Connect API key stored as an environment variable on RStudio Connect.

One of the nice features of data pins on RStudio Connect is that users can see a rendering of the data in the RStudio Connect UI.

A data table rendered as a pin on RStudio Connect.

Additionally, access to the pin can be controlled just like for any other asset on RStudio Connect.

Access controls on RStudio Connect.

A Model #

The Bike Prediction App also uses pins to save the current version of the model. The model is automatically re-trained every morning and pinned onto RStudio Connect.

Pinned XGBoost model on RStudio Connect.

This pinned model is then used by both the model assessment script and the plumber API to assess model quality and serve the predictions.

Using Pins, it’s easy to recover the current state of the data or model that’s needed to make a particular analysis work, making your data science work much more repeatable.

Table of Contents