Loading remote files onto a workstation

A scientist often needs to interface with data that resides in locations outside of the Deep Origin OS. Some common use cases are receiving sequencing datasets from contract research organizations (CROs), receiving datasets from collaborators, and downloading data from public databases. Here, we will cover several common ways to get data from external systems onto a Deep Origin workstation.

Data stored in S3-compatible buckets

Data stored in Amazon S3

To copy data to your workstation from an AWS S3 bucket, you need an "AWS Access Key ID" and "AWS Secret Access Key" from the external party. You should also know the unique resource identifier (URI) of the file or directory you want to copy, such as s3://demo-data/sequencing/run_1.fastq.gz.

Configure AWS credentials

To set up credentials to access the external S3 bucket, run aws configure from a terminal within your workstation. If you already have AWS credentials configured and you want to put new credentials in a new profile, run aws configure --profile {NEW_PROFILE}, replacing {NEW_PROFILE} with your desired profile name.
1. Following in the prompts, paste in the "AWS Access Key ID" and "AWS Secret Access Key" given to you by the external party.
2. You can press "enter" to skip the last two questions.

The credentials you add will be reflected in the file ~/.aws/credentials.

Copy the data to your workstation

We recommend using s5cmd, a file copy tool that is faster than the AWS CLI. Run s5cmd --help or s5cmd cp --help to see usage help for the general tool or specific functions. Below are a few examples using s5cmd. You can run any command with the --dry-run flag to print what will be executed prior to modifying any files.

Copy a single file
Copy multiple files with cp
Sync a remote directory to your workstation

To copy a single file from the external bucket to a folder called "sequencing" in the home directory of your workstation:

Copy a single file
s5cmd cp s3://demo-data/sequencing/run_1.fastq.gz /home/bench-user/sequencing/

To copy all .gz files from the external bucket to a folder called "sequencing" in the home directory of your workstation:

Copy multiple files with cp
s5cmd cp s3://demo-data/sequencing/*.gz /home/bench-user/sequencing/

The sync command synchronizes items between the source and destination, copying files that already exist in the destination if size and modification time differ. For more information, please see the s5cmd documentation. To sync the directory "sequencing" from the external bucket to a folder called "sequencing" in the home directory of your workstation:

Sync a remote directory to your workstation
s5cmd sync s3://demo-data/sequencing/* /home/bench-user/sequencing/

Using multiple sets of AWS credentials

You can store multiple sets of credentials, or profiles, for accessing files in different buckets. Each profile must have a unique name. To use the "sequencing-CRO" profile in the file copy command above:

Copy a single file using the "sequencing-CRO" profile
s5cmd --profile sequencing-CRO cp s3://demo-data/sequencing/run_1.fastq.gz /home/bench-user/sequencing/

Using tools in Python, R, and Julia

Many programming languages provide packages for interacting with AWS S3. These packages are installed in software blueprints that include the respective programming language. Please see the following pages for documentation on how to interface with S3 using these packages.

Python: boto3
R: aws.s3
Julia: AWSS3.jl

Using workstation endpoints

Some workstation endpoints also provide a convenient way to interface with AWS S3. For example, the jupyterlab-s3-browser is installed in all workstations and provides a graphical interface for managing files. VS Code users can install the VS Code AWS Toolkit to interact with S3 and other AWS services.

Data stored in Google Cloud Storage, Wasabi, and other S3-compatible sources

Other cloud providers, such as Google Cloud Storage and Wasabi, provide an S3-compatible service that can be accessed with similar commands. You need to add credentials and add the proper --endpoint-url argument to your s5cmd command.

For example, with Wasabi, the first command in the above section becomes:

Copy a single file from a Wasabi S3 bucket
s5cmd --endpoint-url https://s3.wasabisys.com cp \
  s3://wasabi-demo-data/sequencing/run_1.fastq.gz \
  /home/bench-user/sequencing/

For Google cloud storage, add --endpoint-url https://storage.googleapis.com to the command, and see the s5cmd documentation for information on configuring credentials.

Loading remote files onto a workstation

Data stored in S3-compatible buckets​

Data stored in Amazon S3​

Configure AWS credentials​

Copy the data to your workstation​

Using multiple sets of AWS credentials​

Using tools in Python, R, and Julia​

Using workstation endpoints​

Data stored in Google Cloud Storage, Wasabi, and other S3-compatible sources​