Data persistence
Workstations have a persistent home directory (/home/bench-user/
) that enables data, code, and installed applications to persist between stop/start cycles of the workstation. If you save data in the home directory, or install packages with conda, pip, R, Julia, and apt, you can pick up your work right where you left off. Read further to learn about the implementation and implications of this feature.
Persistent home directory design
When you create a workstation, the storage size you select corresponds to the size of persistent local storage volume (currently an AWS Elastic Block Store volume) that is provisioned for your workstation. This volume stores the home directory (/home/bench-user/
) of your workstation, and is also the location of all user-facing applications, conda environments, and configuration files. The exact same volume will be attached to your workstation every time it starts up.
All data, code, and other files saved in the home directory will be saved after a workstation stops and starts. All other files will be removed.
Installing packages with conda, Python, R and Julia
All user-facing applications, including conda, Python, R, and Julia, are installed within the persistent home directory. This means that you can install new packages for these applications using the standard methods and they will be persistent.
- conda/mamba:
mamba install ...
- Python/pip:
pip install ...
- R:
install.packages()
orBiocManager::install()
- Juila:
Pkg.add()
Installing packages from source
Packages installed from source will be persistent as long as the resulting files remain within the home directory. The ~/.local
folder, with subfolders bin
, lib
, and share
is a persistent place to put binaries, libraries, and user data, respectively. You should avoid commands like make install
which may copy files to locations that will not be saved, such as /bin
.
Installing packages with the system package manager
The Deep Origin software blueprints are based on Ubuntu, which uses the apt
command to install system packages. Unfortunately, packages installed with apt
cannot be placed in the home directory. To make these packages persistent, we developed a service that runs in the background on a workstation and saves a log of packages installed with apt install ...
. Then, when a workstation is stopped and started again, the log is used to restore the configuration of system packages. New packages you installed will be re-installed on startup, and packages you removed will be removed again.
Packages installed from custom apt repositories and Personal Package Archives (PPAs) are also logged and restored in the same manner. In cases where a key is used to sign an apt repository, you need to ensure the configuration is saved in the home directory. See the example below to learn how you can modify the signing keys for an apt repository.
There are a few considerations to be aware of:
- Additional system packages that you install will be automatically updated to the latest version when they are re-installed during workstation startup. This happens because older versions of packages are not accessible via apt.
- Installing many apt packages will result in longer workstation startup times, as these packages have to be downloaded and installed before a workstation is available for connection.
Packages installed from downloaded .deb
files, such as with the command sudo dpkg -i my_package.deb
, will NOT be persistent.
Verifying packages installed from third-party package repositories
To validate downloaded packages, Ubuntu can use cryptographic keys. We will use the example of installing Redis to show how to install this package in a persistent manner. Here, we simply place the gpg key in the ~/.local/share
folder, and update the references to the key in the following commands.
- Corrected commands
- Original commands
sudo apt install --yes --no-install-recommends gpg
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /home/bench-user/.local/share/redis-archive-keyring.gpg
echo "deb [signed-by=/home/bench-user/.local/share/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update
sudo apt-get install --yes --no-install-recommends redis
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/redis-archive-keyring.gpg
echo "deb [signed-by=/etc/apt/trusted.gpg.d/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update
sudo apt-get install --yes --no-install-recommends redis
Additional features and support
Persistence is a rapidly evolving feature. If you encounter unexpected behavior or have a feature request, please submit a support request from the Deep Origin OS or by contacting customer support.
Technical details
Workstations are based on Docker containers. This allows the Deep Origin team to rapidly develop new software blueprints, ensures that software blueprints can run on virtually any infrastructure, and allows external developers to contribute new Docker images to run on workstations. However, Docker containers are typically used when the state of the container does not need to be persisted. Running a web app, where user data is stored to a database, or executing a bioinformatics pipeline, where results are saved to cloud storage, are two such examples.
In the case of a workstation, the state of the container contains a lot of information. User-installed packages, intermediate analysis results, and code changes are a few such examples. These modifications to the state of the container are typically ephemeral, which is why you have to install the same packages each time you start an environment on most other data science platforms. Typically, persistent package installations and data storage are associated with virtual machines, such as AWS EC2 instances.
At Deep Origin, we developed a solution that allows a workstation to retain the benefits of a Docker container and the persistence of a virtual machine. This makes a workstation feel more like a machine that you own - any configuration and settings changes you make to the workstation will remain after stop/start cycles of the workstation.