Snapshot and Restore
The snapshot and restore strategy fits when package access is open and users are responsible for reproducibility. This strategy is the most relevant for individual data scientists. The strategy has two key characteristics:
- Users are able to freely access and install packages for a project
- Users have the full responsibility to record the dependencies needed for a project
The strategy is implemented with the following steps:
- Start a project by creating a project-specific library
- Install and use packages from the project-specific library
- Record the state of the library alongside of the code
- Restore the library when the environment needs to be recreated
A potential drawback of this strategy is the involvement required from the R user. For new users, these steps can create an energy barrier that prevents them from being successful. Often organizations will start new users (e.g. Excel converts) with a different strategy, and allow power R users the flexibility and responsibility of this strategy.
graph LR A[Create Project] B[Write Code] C[Install Packages] D[Snapshot] E[Restore] A --> B B <--> C B --> D D --> E
- (Administrators) Install each desired version of R.
- (Users) Install the
Step 1: Initialize a Project#
A key to package management is to isolate projects from one another. This allows you to upgrade or add packages for one project without breaking other work. Whether you are in an existing project or starting a new project, use:
Behind the scenes,
renv works by creating a new library. A library stores installed packages.
Step 2: Install and Use Packages#
With the project configured, you can now install and use packages. There are three ways to install packages:
pak::pak_installif you're installing interactively.
remotes::install_*if you're scripting the install (e.g. in a Docker container).
install.packagesas a fall back option.
# You can use install.packages install.packages('ggplot2') # But we recommend using pak in interactive settings pak::pkg_install('ggplot2') # Or use remotes if you're working on an automated script or # in a lightweight environment like Docker remotes::install_cran('ggplot2')
Use packages just how you normally would!
Step 3: Snapshot the Environment#
Once you are ready to share your work, or you are finished with a project, you'll want to make a record of the current environment.
This step creates a new file in your project titled
renv.lock. The file contains all the information you need to communicate your project's dependencies at the moment you call
snapshot. The next time you call
snapshot, the file will be updated.
If you are familiar with version control for your code, we recommend calling
snapshot anytime you push or check-in changes to your code. The
renv::revert commands make it easy to navigate and restore prior versions of the lock file.
Step 4: Recreate the Environment#
This step is where the work above pays off! If you need to share your work with others, or need to roll back changes to get back to a working library, cash in by using restore:
# open the project, and use renv::restore()
renv will recreate the package environment for you, and you'll be back to working on R code instead of troubleshooting problems!
Watch a video demo of Snapshot and Restore with
Implementing the Snapshot Strategy in Production#
In some organizations, you may only want to worry about recording project dependencies when a project is ready for production. Generating a manifest of dependencies can be the first step in a deployment hand-off between a development environment and a production deployment. Learn more about snapshotting for production.
If you are using RStudio Connect then the snapshot strategy is automatically applied when content is deployed.
Common Challenges and Resolutions:#
Versions of R#
To ensure the library is restorable, you'll need to record and make available the same version of R used during development. The
renv package automatically records the version of R used by a project. We recommend having multiple versions of R available, so that users can pick the version of R and then restore. This approach is also an effective way to test if a project is ready to upgrade to a new version of R.
Non-Current CRAN Packages#
Often, by the time a project is restored, some of the packages in use may have been updated on CRAN. For example:
- On January 1st, a project manifest is committed that records
ISLRversion 1.0 as a dependency.
- On February 1st, the
ISLRpackage is upgraded to
- On March 1st, a user wishes to restore the environment.
In this case, it is critical that version 1.0 of
ISLR is used in the restored
environment. To make this happen, the older version of the package needs to be
accessed and installed. Luckily, this is possible using a repository's
archive. Internal repositories used to support the
snapshot strategy must record archived versions. RStudio Package Manager is an
easy way to ensure your internal repository handles this case appropriately.
If your package is publicly available, tools like
renv will work automatically. If you wish to use the snapshot strategy along with internal packages (packages that are not publicly available on CRAN nor in a public Git repository), it is easiest to store and source those internal packages in a CRAN-like repository. Follow these steps:
- Release the internal package to the CRAN-like repository
- Install and use the package in the project, installing from the repository
- Record the project dependencies
- Restore the project by accessing the appropriate version of the package from the CRAN-like repository
It is critical that older versions of the internal package are appropriately stored in the repository's archive. The easiest way to create a correct internal repository, distribute internal packages, and support the snapshot strategy is using RStudio Package Manager
Multi-Lingual Projects (Python)#
If your project uses more than R, you'll need to capture the project's other dependencies as well. A common scenario is a reticulated project that uses Python and R. In this case, one option is to combine
renv with a Python package management tool like
renvas described previously to manage R packages
- Use a
virutalenvto isolate project Python dependencies
- Record the state of the
pip freeze > requirements.txt
- On restore, recreate the Python
virtualenvand then use
To automate some of these steps, take advantage of the
A common challenge in the snapshot and restore approach is that each project relies on an isolated library. Naively, this would mean each project library would start empty and users would have to re-install their desired packages. In practice, this naive approach is slow - especially on systems where packages must be compiled.
To solve this problem, implementations of the snapshot and restore strategy should rely on a
package cache or a repository that serves
binaries for the operating systems in-use.
renv creates a cache for each user. This means if two projects
ggplot2 version 3.1.0, the user will only need to install
3.1.0 once. A repository that serves binaries accomplishes the same result,
effectively caching installed packages for all users!
Often restoring a project on a different computer or a new system can take time because the necessary packages may not be cached. This challenge is especially prevalent if the project uses non-current CRAN packages, because these packages do not usually have a binary version available in a repository.
Unfortunately, many organizations and platforms assume using Docker will give them the benefits of reproducibility. The good news is that Docker does a great job isolating project dependencies. The bad news is that Docker does not record the versions of project dependencies. Luckily, Docker can be used with the snapshot and restore strategy. For example, say you wanted to use Docker to execute an ETL job:
FROM ubuntu ... RUN git clone https://github.com/me/etl-project.git RUN R -e 'renv::restore()' CMD <some process>