sda.dive to the Rocket cluster¶
User manual to download data from FEGA to the UTHPC cluster.
Data download consists of two separate tasks:
- Apply for data access.
- File download.
Apply for data access¶
- Login into REMS .
- Browse Catalogue for different datasets.
- If you find the desired dataset, press the Add to cart button at the end of the dataset. If you wish, you can add several datasets and request access to several datasets at the same time.
- The cart is shown in top part of webpage.
- When you have finished adding datasets, press the Apply button in bottom of the cart.
- The data request form will open, which you can review and if everything seems correct, click Send application.
- Contact Vlad or Tommy, who can review application and approve data access.
Generate a C4GH key pair.¶
- Log into UTHPC cluster with your account.
- All terminal commands must be run in the terminal window of the UTHPC cluster.
- To activate the crypt4gh command-line tool type in terminal window:
module load crypt4gh/1.8.5
-
Run command to generate the key pair. It's highly recommended to insert a passphrase and safe it for later use:
Where:crypt4gh generate -n <c4gh_name>
<c4gh_name>
is the name of your crypt4gh key file.
-
You will be asked to insert a passphrase, it's highly recommended to do so and save the passphrase for later use.
- The output of the above mentioned command is comprised of two files
<c4gh_name>.sec.pem
and<c4gh_name>.pub.pem
the crypt4gh private and public key accordingly.
File download¶
Download is possible only, when you have access to at least one dataset, that contains also at least one file.
Get Elixir AAI token¶
Go to https://fega.etais.ee/login to get Elixir AAI token.
You should see the access token with the Copy to clipboard button below (click it).
Make the environment token¶
Copy the token and make the environment token. On the terminal run command:
export token=<access token>
<access token>
is Elixir AAI token from the previous step.
Datasets¶
To activate the sda-dive
command-line tool type in terminal window:
module load crypt4gh/1.8.5
Check for the datasets made available to you. On the terminal run command:
sda-dive datasets
The list of datasets available to you will be displayed.
Example output
Listing datasets...
EGAD50000000191
EGAD50000000199
...
Important
Access token is valid for 1 hour only. To continue after 1 hour you have to retrieve and re-export a new token by repeating steps 1&2. If you are granted access to new datasets and you don't see them, then please clear your browser cache and start again from step 1.
Files in the dataset¶
Check the files in the dataset.
Check the files in the dataset. On the terminal run command:
sda-dive files <datasetID>
<datasetID>
is dataset ID from step 3 (EGAD50000000199
in our example).
Example output
Listing files for dataset: EGAD50000000261
fileId: EGAF50000105565
datasetId: EGAD50000000261
displayFileName: 2023-10-24 sample 1.vcf.c4gh
filePath: tommy.tomson_ut.ee/2023-10-24 sample 1.vcf.c4gh
fileName: b611a9bc-2a4a-4388-866c-fc46ea107e3d
fileSize: 66951814
decryptedFileSize: 66923198
decryptedFileChecksum: 4a42ec4702556754776aa2b861733a84cb150c13d1f243cd06cb1ac73a1f7cbd
decryptedFileChecksumType: SHA256
fileStatus: ready
createdAt: 2024-01-12T13:57:16.845924Z
lastModified: 2024-01-15T10:21:01.009706Z
fileId: EGAF50000105566
datasetId: EGAD50000000261
displayFileName: 2023-10-24 sample 2.vcf.c4gh
filePath: tommy.tomson_ut.ee/2023-10-24 sample 2.vcf.c4gh
fileName: ca80e640-729c-47df-88b7-1463dc991495
fileSize: 68584979
decryptedFileSize: 68555663
decryptedFileChecksum: 8e7c6d5f1d67bdd9b89c48f94aa60481e4e99d63712007fa5ef0e6bdc0f7788a
decryptedFileChecksumType: SHA256
fileStatus: ready
createdAt: 2024-01-12T14:03:30.790821Z
lastModified: 2024-01-15T10:21:01.046876Z
Encryption¶
For data protection purposes all files are downloaded in encrypted form and can be decrypted later only by you. Therefore first make the crypt4gh public key (<c4gh public key>
) from Generate a c4gh key pair, you need them for download part.
Download¶
To download a specific file with a manual destination filename, use the following command syntax:
sda-dive -p <c4gh_name>.pub.pem download <fileID> -o <destinationFileName>.c4gh
<c4gh_name>.pub.pem
is name of your c4gh public key.<fileID>
is file ID from step 4 (EGAF50000105565
in our example).<destinationFileName>
is the downloaded file name in your system. NB! the downloaded file is encrypted, so it needs.c4gh
extension.
The file will be saved in the current working directory of your terminal.
For downloading files while preserving their original names and extensions, execute:
sda-dive -p <c4gh_name>.pub.pem download <datasetID> <fileID>
<c4gh_name>.pub.pem
is name of your c4gh public key.<datasetID>
is dataset ID from step 4 (EGAD50000000261
in our example)<fileID>
is file ID from step 4 (EGAF50000105565
in our example).
This command automatically retrieves and uses the original filename and extension from the archive metadata for the saved file, placing it in the terminal's active directory.
To download all files within a dataset:
sda-dive -p <c4gh_name>.pub.pem download <datasetID> --all-files
<c4gh_name>.pub.pem
is name of your c4gh public key.<datasetID>
is dataset ID from step 4 (EGAD50000000261
in our example)
This command downloads all files associated with the specified dataset into the current working directory.
Decrypt¶
Activate the required software module in your HPC cluster terminal window:
module load crypt4gh/1.8.5
Decrypt download file by running on the terminal command:
crypt4gh decrypt --file=<my_data_file> -s <c4gh_name>.sec.pem
Where:
<my_data_file>
is the name of your file that you want to decrypt,<c4gh_name>.sec.pem
is the crypt4gh private key file name.- You will be asked whether you want to enter passphrase if you added one.
- You should have a new file without the ending
.c4gh
- this is your decrypted file.