Skip to content

sda.dive to the Rocket cluster

User manual to download data from FEGA to the UTHPC cluster.

Data download consists of two separate tasks:

  1. Apply for data access.
  2. File download.

Apply for data access

  • Login into REMS .
  • Browse Catalogue for different datasets.
  • If you find the desired dataset, press the Add to cart button at the end of the dataset. If you wish, you can add several datasets and request access to several datasets at the same time.
  • The cart is shown in top part of webpage.
  • When you have finished adding datasets, press the Apply button in bottom of the cart.
  • The data request form will open, which you can review and if everything seems correct, click Send application.
  • Contact Vlad or Tommy, who can review application and approve data access.

Generate a C4GH key pair.

  • Log into UTHPC cluster with your account.
  • All terminal commands must be run in the terminal window of the UTHPC cluster.
  • To activate the crypt4gh command-line tool type in terminal window:
    module load crypt4gh/1.8.5
    
  • Run command to generate the key pair. It's highly recommended to insert a passphrase and safe it for later use:

    crypt4gh generate -n <c4gh_name>
    
    Where:

    • <c4gh_name> is the name of your crypt4gh key file.
  • You will be asked to insert a passphrase, it's highly recommended to do so and save the passphrase for later use.

  • The output of the above mentioned command is comprised of two files <c4gh_name>.sec.pem and <c4gh_name>.pub.pem the crypt4gh private and public key accordingly.

File download

Download is possible only, when you have access to at least one dataset, that contains also at least one file.

Get Elixir AAI token

Go to https://fega.etais.ee/login to get Elixir AAI token.

You should see the access token with the Copy to clipboard button below (click it).

Make the environment token

Copy the token and make the environment token. On the terminal run command:

export token=<access token>
Where:

  • <access token> is Elixir AAI token from the previous step.

Datasets

To activate the sda-dive command-line tool type in terminal window:

module load crypt4gh/1.8.5

Check for the datasets made available to you. On the terminal run command:

sda-dive datasets

The list of datasets available to you will be displayed.

Example output
Listing datasets...
EGAD50000000191
EGAD50000000199
...

Important

Access token is valid for 1 hour only. To continue after 1 hour you have to retrieve and re-export a new token by repeating steps 1&2. If you are granted access to new datasets and you don't see them, then please clear your browser cache and start again from step 1.

Files in the dataset

Check the files in the dataset.

Check the files in the dataset. On the terminal run command:

sda-dive files <datasetID>
Where:

  • <datasetID> is dataset ID from step 3 (EGAD50000000199 in our example).
Example output
Listing files for dataset: EGAD50000000261
fileId: EGAF50000105565
datasetId: EGAD50000000261
displayFileName: 2023-10-24 sample 1.vcf.c4gh
filePath: tommy.tomson_ut.ee/2023-10-24 sample 1.vcf.c4gh
fileName: b611a9bc-2a4a-4388-866c-fc46ea107e3d
fileSize: 66951814
decryptedFileSize: 66923198
decryptedFileChecksum: 4a42ec4702556754776aa2b861733a84cb150c13d1f243cd06cb1ac73a1f7cbd
decryptedFileChecksumType: SHA256
fileStatus: ready
createdAt: 2024-01-12T13:57:16.845924Z
lastModified: 2024-01-15T10:21:01.009706Z
fileId: EGAF50000105566
datasetId: EGAD50000000261
displayFileName: 2023-10-24 sample 2.vcf.c4gh
filePath: tommy.tomson_ut.ee/2023-10-24 sample 2.vcf.c4gh
fileName: ca80e640-729c-47df-88b7-1463dc991495
fileSize: 68584979
decryptedFileSize: 68555663
decryptedFileChecksum: 8e7c6d5f1d67bdd9b89c48f94aa60481e4e99d63712007fa5ef0e6bdc0f7788a
decryptedFileChecksumType: SHA256
fileStatus: ready
createdAt: 2024-01-12T14:03:30.790821Z
lastModified: 2024-01-15T10:21:01.046876Z

Encryption

For data protection purposes all files are downloaded in encrypted form and can be decrypted later only by you. Therefore first make the crypt4gh public key (<c4gh public key>) from Generate a c4gh key pair, you need them for download part.

Download

To download a specific file with a manual destination filename, use the following command syntax:

sda-dive -p <c4gh_name>.pub.pem download <fileID> -o <destinationFileName>.c4gh
Where:

  • <c4gh_name>.pub.pem is name of your c4gh public key.
  • <fileID> is file ID from step 4 (EGAF50000105565 in our example).
  • <destinationFileName> is the downloaded file name in your system. NB! the downloaded file is encrypted, so it needs .c4gh extension.

The file will be saved in the current working directory of your terminal.

For downloading files while preserving their original names and extensions, execute:

sda-dive -p <c4gh_name>.pub.pem download <datasetID> <fileID>
Where:

  • <c4gh_name>.pub.pem is name of your c4gh public key.
  • <datasetID> is dataset ID from step 4 (EGAD50000000261 in our example)
  • <fileID> is file ID from step 4 (EGAF50000105565 in our example).

This command automatically retrieves and uses the original filename and extension from the archive metadata for the saved file, placing it in the terminal's active directory.

To download all files within a dataset:

sda-dive -p <c4gh_name>.pub.pem download <datasetID> --all-files
Where:

  • <c4gh_name>.pub.pem is name of your c4gh public key.
  • <datasetID> is dataset ID from step 4 (EGAD50000000261 in our example)

This command downloads all files associated with the specified dataset into the current working directory.

Decrypt

Activate the required software module in your HPC cluster terminal window:

module load crypt4gh/1.8.5

Decrypt download file by running on the terminal command:

crypt4gh decrypt --file=<my_data_file> -s <c4gh_name>.sec.pem

Where:

  • <my_data_file> is the name of your file that you want to decrypt,
  • <c4gh_name>.sec.pem is the crypt4gh private key file name.
  • You will be asked whether you want to enter passphrase if you added one.
  • You should have a new file without the ending .c4gh - this is your decrypted file.