Auto-Scaling with Workbench and Kubernetes

Workbench includes a Kubernetes plugin for running sessions and background jobs in Docker containers on Kubernetes. A common implementation of this uses a fixed-sized architecture where administrators manually scale the architecture if needed. Some administrators may instead want to automatically scale the number of nodes in their clusters.

Auto-scaling with Kubernetes can help optimize resource usage and costs by automatically scaling your Kubernetes cluster used for Workbench sessions up and down in line with demand.

However, auto-scaling is an advanced configuration option and is not required for most uses of Workbench with Kubernetes. It is difficult to properly implement auto-scaling, and improper configuration can result in service disruptions to end users. Before attempting auto-scaling, you should consider the engineering effort to implement auto-scaling, the maintenance burden to monitor and ensure that the cluster is behaving as expected, the cost of failures, and the cost of user frustration. An alternative approach is using a fixed-size architecture.

In the following sections, we will go through a fixed-size architecture, and an auto-scaled architecture.

Workbench - Fixed-Size Architecture (No auto-scaling)

In a fixed-sized architecture, the number of nodes in the Kubernetes cluster is set to meet anticipated demand, with a buffer to provide additional capacity. The admin monitors the usage of the cluster and manually increases the number of nodes as needed. Benefits of this architecture pattern include:

Helps save on operational and engineering costs
Has fewer user experience risks
Easier to maintain due to its lack of complexity

The downside of this deployment pattern is that you may have to pay for additional infrastructure that goes unused.

Risks of Auto-scaling

Users expect a high level of reliability when connecting to an Workbench session, and for auto-scaling to be successful, it needs to balance resource usage with operational demand and user experience. With this in mind, there are a few risks associated with auto-scaling that you will need to consider:

Issue	User Impact	Cause
Cluster Scales Up too Slowly	User wait times can be long if session requests frequently trigger resource provisioning.	Provisioning new nodes takes a long time relative to starting sessions on existing nodes.
Cluster Scales Down too Quickly	User loses their stateful session. User loses their job results. Jobs with business value may be killed and need to be restarted.	Nodes are killed before user tasks are completed.
Cluster Scale back Fails	Excess expense of running instances	Autoscaler failure

Auto-Scaling Configuration

There are a few things needed to be able to set up your auto-scaling architecture successfully.

Configure timeouts in these two sections in the Workbench configuration files.
- rsession: This section provides the configuration for the rsession.conf file which controls behaviour of the RStudio IDE processes, allowing you to tune various R session parameters.
- jupyter: This section provides the configuration for the jupyter.conf file, which configures jupyter sessions.
Add annotation to the Launcher Kubernetes configuration to prevent automatic eviction of pods.
- launcher.kubernetes.profiles.conf: This section provides the configuration for the Kubernetes Job Launcher Plugin.
Setup the Cluster Autoscaler.
- This is a component that automatically adjusts the size of a Kubernetes Cluster so that all pods have a place to run. This is a third party software mainted by the Kubernetes project.

More details on the specific components needed for configuration are outlined in the following GitHub repo.