Loading remote files onto a workstation
A scientist often needs to interface with data that resides in locations outside of the Deep Origin OS. Some common use cases are receiving sequencing datasets from contract research organizations (CROs), receiving datasets from collaborators, and downloading data from public databases. Here, we will cover several common ways to get data from external systems onto a Deep Origin workstation.
Data stored in S3-compatible buckets
Data stored in Amazon S3
To copy data to your workstation from an AWS S3 bucket, you need an "AWS Access Key ID" and "AWS Secret Access Key" from the external party. You should also know the unique resource identifier (URI) of the file or directory you want to copy, such as s3://demo-data/sequencing/run_1.fastq.gz
.
Configure AWS credentials
- To set up credentials to access the external S3 bucket, run
aws configure
from a terminal within your workstation. If you already have AWS credentials configured and you want to put new credentials in a new profile, runaws configure --profile {NEW_PROFILE}
, replacing{NEW_PROFILE}
with your desired profile name.- Following in the prompts, paste in the "AWS Access Key ID" and "AWS Secret Access Key" given to you by the external party.
- You can press "enter" to skip the last two questions.
The credentials you add will be reflected in the file ~/.aws/credentials
.
Copy the data to your workstation
We recommend using s5cmd
, a file copy tool that is faster than the AWS CLI. Run s5cmd --help
or s5cmd cp --help
to see usage help for the general tool or specific functions. Below are a few examples using s5cmd
. You can run any command with the --dry-run
flag to print what will be executed prior to modifying any files.
- Copy a single file
- Copy multiple files with cp
- Sync a remote directory to your workstation
To copy a single file from the external bucket to a folder called "sequencing" in the home directory of your workstation:
s5cmd cp s3://demo-data/sequencing/run_1.fastq.gz /home/bench-user/sequencing/
To copy all .gz
files from the external bucket to a folder called "sequencing" in the home directory of your workstation:
s5cmd cp s3://demo-data/sequencing/*.gz /home/bench-user/sequencing/
The sync command synchronizes items between the source and destination, copying files that already exist in the destination if size and modification time differ. For more information, please see the s5cmd documentation. To sync the directory "sequencing" from the external bucket to a folder called "sequencing" in the home directory of your workstation:
s5cmd sync s3://demo-data/sequencing/* /home/bench-user/sequencing/
Using multiple sets of AWS credentials
You can store multiple sets of credentials, or profiles, for accessing files in different buckets. Each profile must have a unique name. To use the "sequencing-CRO" profile in the file copy command above:
s5cmd --profile sequencing-CRO cp s3://demo-data/sequencing/run_1.fastq.gz /home/bench-user/sequencing/
Using tools in Python, R, and Julia
Many programming languages provide packages for interacting with AWS S3. These packages are installed in software blueprints that include the respective programming language. Please see the following pages for documentation on how to interface with S3 using these packages.
Using workstation endpoints
Some workstation endpoints also provide a convenient way to interface with AWS S3. For example, the jupyterlab-s3-browser is installed in all workstations and provides a graphical interface for managing files. VS Code users can install the VS Code AWS Toolkit to interact with S3 and other AWS services.
Data stored in Google Cloud Storage, Wasabi, and other S3-compatible sources
Other cloud providers, such as Google Cloud Storage and Wasabi, provide an S3-compatible service that can be accessed with similar commands. You need to add credentials and add the proper --endpoint-url
argument to your s5cmd
command.
For example, with Wasabi, the first command in the above section becomes:
s5cmd --endpoint-url https://s3.wasabisys.com cp \
s3://wasabi-demo-data/sequencing/run_1.fastq.gz \
/home/bench-user/sequencing/
For Google cloud storage, add --endpoint-url https://storage.googleapis.com
to the command, and see the s5cmd documentation for information on configuring credentials.