Scheduling Data Science Tasks
After creating some amazing artifact, it is very common for data scientists to worry about how to keep it updated. Dashboards and reports need to show the latest data, models need to be retrained, and sometimes end users will even request regular notifications, such as emails.
The good news is that data scientists work in code, so it is possible to automate most of these update tasks. A naive approach would be to manually re-run a script every time an update is needed. To avoid manual labor, a variety of solutions exist. These solutions fall on a spectrum, from simple solutions with limited capabilities to more robust solutions that offer flexibility but a steeper learning curve. This article outlines a number of options.
Scheduling content to run on your desktop is the easiest approach. On Windows machines, the Windows Task Scheduler can be used to execute R code on a schedule or following an event. The taskscheduleR package provides an R-specific wrapper. On Mac and Linux, the cron utility can be used, and the cronR package provides a helpful wrapper.
The main limitation to scheduling on a laptop or desktop is downtime. Most people do not leave their laptops running indefinitely, and often local workstations are restarted for updates. These interruptions can conflict with scheduled tasks.
While scheduling on a desktop is widely available, it requires significant work to monitor schedules, ensure the correct software and packages are available, and capture success or failure logs.
- Widely available and easy to get started
- Laptops/desktops are frequently off or offline, interrupting schedules
- All environment setup and logging must be built out manually
cron on a Server
A step up from scheduling tasks on a local machine is scheduling them on a server. For Linux servers, the cron utility is widely available and very flexible. Schedules are defined in a crontab file, and typically the schedule will instruct the server to execute a shell script. These shell scripts provide total flexibility, allowing you to setup an environment, execute code, log side affects, and more. cron allows you to specify where log output files and errors should go.
# run the script run.sh every day at 11am 00 11 * * * run.sh
# sample run.sh /opt/R/3.6.3/bin/R -f 'update.R'
A common gotcha for running R jobs on a server is package management. If you setup a script to run, it is important that the script have access to the correct packages. It can be very easy to forget about these scripts when updating packages for other projects leading to unexpected errors. Or, you can find yourself in a situation where one script requires specific versions of a package that different from another script. While there are workarounds for these problems, such as the renv package, it is critical to consider package dependencies for long-term stability.
- Widely available on most Linux servers
- Very flexible
- Servers, unlike local workstations, tend to have higher up-time and more robust processes for down-time
- Requires the user to handle everything: logging, environment setup, error handling
Scheduling on RStudio Connect
RStudio Connect sits in a sweet spot for data science scheduling. It provides flexibility and robustness, while remaining easy to use. Data scientists are able to write code in either R Markdown documents or Jupyter notebooks, and then publish and schedule those on RStudio Connect.
During the publication process, RStudio Connect automatically handles package dependencies. Once scheduled, RStudio Connect ensures the content is run and handles logs, sending emails on errors, and maintaining a versioned history of prior runs. Additionally, because the scheduled code is within a notebook, it is easy to document in place the purpose of the scheduled code. Data scientists can even customize email notifications if they would like to receive updates, such as a summary of new processed data.
View the sample projects and code for more details.
- Automatically handles package dependencies, logs, custom email notifications, error emails, and versioned history
- Highly accessible to data scientists through one-click publishing or Git-backed deployment
- Scheduled notebooks are all independent, it is is not possible to define a sequence of scheduled tasks
- Execution is limited to the servers where RStudio Connect is installed
Using an External Scheduler
On the far end of the specturm is the category of dedicated scheduling software. Examples of this software include tools like Luigi, Airflow, Oozie, Jenkins, and many others. These tools are varied in their features and intent.
Most of these options require a dedicated application support team to ensure they are correctly configured and regularly updated, though cloud vendors often offer these tools as hosted services.
Most of these tools have robust support for scheduling *D*irected *A*cyclic *G*raphs, which allow users to specify dependencies or sequences of tasks. Often these tools take advantage of caching and allow for seemless re-runs or backfills.
Many of these tools offer support for flexible scaling, co-ordinating execution across multiple servers or in environments like Hadoop or Kubernetes
In addition to learning these tools, data scientists will often need to account for and manage package dependencies themselves, especially for R workflows.
- Most flexible and complete feature set including support for DAGs, multiple execution backends, re-runs, backfills, and more.
- Typically require dedicated application support
- Steep learning curve
- Manual package management strategy for R workflows