Dev/Test/Prod with RStudio Team#
It is common for an analysis project to lead into a second phase. In this second phase, one or several data products are developed. A data product could be a dashboard, report, API or ETL process. It takes the insights gathered during the analysis phase, and makes them available in a permanent basis to stakeholders.
Unlike the experimental nature of a data analysis, a data product has to work consistently when consumed. This means that the code for the data product will need to be developed in a more formal manner. Development can occur in three basic stages:
- The product is developed and tested by the developer
- One, or a few, stakeholders test the product for functionality
- The product is made available to all stakeholders
Each of these stages occur in separate environments, respectively referred to as:
After the product is successfully tested in each stage, the code is then promoted to the next stage.
Code promotion in RStudio Team#
In RStudio Team, code developing and testing is done within two of our products. As illustrated in this section's diagram, development and unit testing happens in RStudio Workbench. Only developers, such as the data scientists or data analysts, need access to RStudio Workbench. They perform unit testing of the product before making it available to other stakeholders.
Once the data product is ready for review, then the developer will deploy the data product to RStudio Connect. The stakeholders who are responsible for making sure that the product works as expected, are then able to access it via RStudio Connect. This is called User Acceptance testing (UAT).
The data product may depend on external assets, such as databases or shared drives. It is important to make sure that they are still accessible to the data product once deployed to RStudio Connect. This is called Integrated testing.
After all testing is completed, the data product is made available to all stakeholders for consumption. In some cases, when the data product is a script that performs data transformation, or ETL, the last stage is to also schedule the frequency in which the script is to run. These steps are completed within the RStudio Connect product.
Deployment with RStudio Connect#
There are a few ways to deploy content to RStudio Connect. By deployment, we mean moving the code, the dependent files, and the metadata concerning R and R packages that the data product uses. To learn about available options to deploy to RStudio Connect, see our article on deployments.
RStudio Package Manager#
Here are two scenarios in which using RStudio Package Manager is needed for a successful promotion of code:
Some organizations do not allow servers to have access to the Internet. Actions, such as patching and upgrades are performed offline. This is called an air-gapped environment. This means that RStudio Workbench and RStudio Connect will not be able to download packages on-demand. RStudio Package Manager allows for someone in the enterprise to download CRAN manually and then perform the update offline. RStudio Package Manager becomes the source of packages for the other two products.
Most organizations use a combination of RStudio Workbench and the open-source desktop version of RStudio, called RStudio Desktop. Access to different sources of packages will vary from software that runs on someone's laptop, than the access of a central server. Using RStudio Package Manager ensures that both are able to access the exact same packages.
We recommend that each RStudio Team is installed in their own, independent server environment. Server environment, here refers to a single server, or a cluster of multiple servers, such as those used to provide High Availability. There should be at minimum three server environments. In this mode, the Test and Production stages will occur in the same server environment.
Separate Test and Production#
A preferable setup may be to have a separate server environment for Test and Production. This ensures that resources needed to serve data products that are already in Production will not be impacted by ongoing tests. Another reason to have separate server environments is to limit who can publish data products to Production. For example, the R developer is able to deploy a data product to the Test server environment, but will need to request that I.T. deploy the final product to the Production server. That ensures that there are no changes made in the official version of the data product that were not fully tested and approved.
Testing server upgrades#
Eventually, the servers themselves will need to be patched or upgrade. For example, the RStudio software installed in the server may need to be upgraded. Before upgrading the servers used for code development and deployment, it is a good idea to test the changes in a separate server environments. These are called staging servers. These server environments are meant to mirror the servers that are in regular use. The staging servers are infrequently used, and usually only I.T. and maybe some R developers will have access to them. They are meant to only confirm that software upgrades were successful.
Why not a cron job inside RStudio Workbench?#
There are cases when an R script needs to run on a regular basis, and also for the foreseeable future. It is very common that over time, the number of those scripts grow, both in number and importance. Depending on a single developer to run all of the scripts becomes a problem. The solution for that is to automate the scripts.
Please be aware, that at this point, those scripts are no longer considered to be "in-development". When the enterprise, or a team in the enterprise, depend on these scripts to run on a regular and consistent basis, that is a Production script. As such, these should be moved to RStudio Connect.
There is also a practical reason to move the scripts to RStudio Connect. The cron job depends on the same user, with the same version of R, and version of the packages to run the script on a regular frequency. RStudio Connect handles all the dependencies and the scheduling in a safe and consistent manner.
RStudio Connect isolates each data product that is deployed to it, so there are no issues with some data products using one version of a given package, while other data products use a different version of the same package. RStudio Connect makes sure that no package version collision exists.