ETL on RStudio Connect made easy

How-to: Build an ETL Processes with RStudio Connect using R Markdown Output Files

There are several ways to architect an ETL process using scheduled R Markdown on RStudio Connect. One of the big decisions you’ll need to make is choosing where to land the resulting data.

We have examples of ETL processes that land data in another database and in a shared persistent storage location. This guide describes how R Markdown output files accessed over HTTP can be a good alternative to setting up a shared read/write persistent storage location on RStudio Connect.

Start at the beginning: What is an ETL procces?

ETL stands for Extract, Transform, Load. It can take many forms, and act as a broad term for many types of workflows, but ETL processes are generally automated. Here is a general breakdown of what ETL often looks like with R Markdown and RStudio Connect:

  • Extract: Pull data in from some source location (ex. A database or API)
  • Transform: Use R Markdown code chunks to operate on the data (ex. Perform data cleaning tasks, test/validate data integrity, apply business logic, create new variables or new summaries)
  • Load: Land the transformed data product(s) in a place where it can be accessed by downstream users or applications (ex. another database, shared persistent storage, output file)

As mentioned above, we already have example assets for ETL processes that terminate with pushing data to a database or updating a file in a shared persistent storage location. This article focuses on a third option: creating R Markdown output files. We assume that you’re already familiar with publishing R Markdown documents to RStudio Connect. To learn more about basic publishing workflows, please visit the RStudio Connect User Guide.

Guide: The Basic R Markdown output file ETL

R Markdown output files are described in detail in the RStudio Connect User Guide:

Output files are files that live alongside your rendered report. They could be plots, data files, or other artifacts generated from the R code in your report. Output files will be available via HTTP, and they will be versioned in the same way as your report. Output files are also subject to the same access controls as your report.

This means that we can specify one or more files as rsc_output_files under the rmd_output_metadata: field in the R Markdown YAML header, and that file will be accessible over HTTP on the Connect server. Those files can then be downloaded, referenced by external systems, or become the data input to shiny applications, R Markdown reports or other data assets hosted on RStudio Connect.

Follow the steps below to build a lightweight ETL process in R Markdown that leverages output files:

Step 1: Create a new R Markdown document

Set up the YAML header to use rsc_output_files:

---
title: "Output File Framework for R Markdown ETL on RStudio Connect"
output: html_document
rmd_output_metadata:
   rsc_output_files:
      - "data.csv"
---

Step 2: Use R Markdown code chunks to extract and transform data

Every time the report is rendered, it should execute code to pull from a data source, and operate on that data to get it into a useable form. Common data sources include databases and APIs.

  • Best practices for working with databases can be found at db.rstudio.com
  • The httr package is a good place to start when working with REST APIs and the http protocol

Step 3: Write the data file

Add an R Markdown code chunk to write the final data frame to a CSV file. This creates the data.csv output file specified in the YAML header (Step 1).

write.csv(df, "data.csv", row.names=FALSE)

Note on working with feather files: While you can create output files in the feather format, know that the feather package currently does not support reading a file from a URL. This means that downstream data assets will not be able to call read_feather() on the output file URL.

Make a download link to share the output file from your report using standard Markdown syntax:

#### Here is the data generated from this report: [data.csv](data.csv)

Step 5: Publish

Publish the R Markdown document to RStudio Connect. Make sure to publish with source code so that the report can be re-rendered on a schedule.

Step 6: Configure Settings - Access and Scheduling

Once the R Markdown document is published to RStudio Connect, use the publisher settings tool to set access controls and the rendering schedule.

Note on Access: The CSV file will be subject to the same authorization as your report. For the most secure experience, keep access controls tight.

Step 7: Access output files from downstream data assets

Assuming strict access controls are not in use, the output file URL should be accessible on the RStudio Connect Server. Functions like read.csv() can accept URL file input. In this example, to point a Shiny application at the new data.csv file coming from R Markdown, the necessary code update (in shiny) is minimal:

# Replace this:
data <- read.csv('data.csv')

# With this:
data <- read.csv('https://colorado.rstudio.com/rsc/content/2352/data.csv')

While this solution is simple, there is still a chance that a long-running R process might not show the most recent data. It’s likely a good idea to put a check in place, or use other Shiny tools like reactiveFileReader and reactivePoll to monitor for changes. Winston Chang created a nice gist code example of that type of application workflow here: https://gist.github.com/wch/9652222

One more cool thing

Because output files are versioned along with the rendering of their report, they can be accessed from the History viewer tool.

Adding a data link (optional Step 4) is useful if you anticipate wanting to download those historical versions.