GATK4 & Cromwell workflow¶

GATK4 & Cromwell workflow is only for Institute of Genomics.

Important

This guide is for existing Cromwell users. If you want to start using Cromwell, please, write to support@hpc.ut.ee .

Introduction to WDL workflows¶

Workflow components¶

<workflow_name>.inputs.json - an input file, it contains paths for all input file and references needed to execute the workflow. Here is a inputs.json example for a test sample .

<workflow_name>.option.json - an option file, the main purpose of this file to specify the output directory of the workflow. Here is a option.json example for a test sample .

Note

JavaScript Object Notation (JSON) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute—value pairs and arrays. JSON objects are for transferring data between server and client. JSON examples Ream more on Wikipedia .

Workflow tools¶

Cromwell server — a workflow execution engine that runs WDL workflows on the UTHPC cluster. Cromwell server talks to Slurm and handles jobs on its own. Cromwell github page and docs .

Note

For Institute of Genomics, Cromwell server is run under UTHPC management.
cromshell - is a command-line tool to talk to Cromwell server.

Note

Automation tools are specific for a particular workflow

Make sure to use the correct ones.
- WholeGenomeGermlineSingleSample (WGS) .
  - createInputWGS.py - an automation script that creates ’inputs.json’ and ’option.json’ for all samples into a given dir. For details look createInputWGS.py .
  - submit-WGS-batch.sh - an automation script that utilises cromshell for an easy submission of workflows. For details look submit-WGS-batch.sh .
- VariantCalling (VC) .
  - createVCinput.py - an automation script that vc_output dir of a sample name dir and ’inputs.json’ and ’option.json’ for all specified samples. For details look createVCinput.py
  - submitVC-batch - an automation script that utilises cromshell for an easy submission of workflows. For details look submitVC-batch

Basic steps to run a workflow¶

Create inputs.json and option.json files for each sample.
Submit the workflow for a sample or more. It tells you all samples to run and workflow ID for each sample. workflow ID looks like this {"id":"xxxxxxxx-xxxx-xxxx-xxxx","status":"Submitted"}.
Check if the workflow has finished with cromshell list -u.

GATK4 & Cromwell module¶

Available GATK4 and Cromwell modules versions:

GATK4: 4.2.0.0 module any/gatk4.
Cromwell: 53.1, 63.1 module any/cromwell/63.1.

Important

There is no need to load the gatk4/cromwell modules to interact with the Cromwell server. You're required to load only cromwell-tools module. Read more below

module load any/cromwell-tools

Cromwell¶

Cromwell server¶

UTHPC has a Singularity Cromwell server and ordinary Cromwell server. They have different ’CROMWELL_URL’, so be careful when choosing the ’CROMWELL_URL’. However, you can always change CROMWELL_URL later .

Native Cromwell server has CROMWELL_URL:172.17.63.1:15000 and runs WGS pipeline and VC pipeline .
Singularity Cromwell has CROMWELL_URL:172.17.63.3:14000 and runs MoChA pipeline

Configure cromshell¶

To interact with Cromwell server use cromshell command. To start using it, you have to configure it first. You need to do the configuration only once.

module load any/cromwell-tools

Depending on what pipeline you intend to run, you chose your Cromwell server. For running WholeGenomeGermlineSingleSample and VariantCalling pipelines, select ’Native Cromwell’. For MoChA pipeline, choose ’Singularity Cromwell’. Insert the appropriate ’CROMWELL_URL’ when cromshell prompts for it.

	Native Cromwell	Singularity Cromwell
CROMWELL_URL	`172.17.63.1:15000`	`172.17.63.3:14000`
Executable Pipelines	WGS & VariantCalling	MoChA

To connect your cromshell to GI Cromwell server, you should enter the details as stated below:

Run cromshell command.
Enter the following info on a prompt:
- Cromwell URL: <insert the required CROMWELL_URL>.
- Confirmation: yes.

Here is an example of setting up for Native Cromwell:

cromshell

Output:

Welcome to Cromshell

What is the URL of the Cromwell server you use to execute
your jobs?
Cromwell URL, please: 172.17.63.1:15000

Oh my! It looks like your server is: 172.17.63.1:15000
Is this correct [Yes/No] : yes

OK. Im now setting your Cromwell server to be: 172.17.63.1:15000
Dont worry - this wont hurt a bit...DONE

OK, you should be ALL SET!

Interacting with Cromwell server¶

To interact with Cromwell server use cromshell command. To start using it, you have to configure it first. You need to do the configuration only once.

module load any/cromwell-tools

Cromshell usage examples¶

Note

To learn how to submit a workflow, please read 'Running your own workflow' section of the pipeline description: WGS , VC and MoChA

For further interaction with Cromwell, here are useful cromshell examples:

# abort a workflow
cromshell abort <workflow id>

# check the last workflow status
cromshell status
# Display a list of jobs submitted through cromshell
cromshell list -c
# Check completion status of all unfinished jobs
cromshell list -u
# Check metadata
cromshell -t 20 metadata <workflow id>

Read more about broadinstitute github cromshell

Change CROMWELL_URL¶

To change an already set ’CROMWELL_URL’, you have to edit cromshell configuration file located in your $HOME. You can comment out the existing CROMWELL_URL and insert the other one on the second line.

vim ~/.cromshell/cromwell_server.config

Workflows¶

Tested and optimised workflows on UTHPC cluster:

WholeGenomeGermlineSingleSample (WGS)
VariantCalling (reduced WGS version)
MoChA WDL pipeline

WholeGenomeGermlineSingleSample¶

All workflow parameters and procedures have been pre-optimised for running on UTHPC cluster.

You can view an NA12878 example output produced by the pipeline at the following path:

/gpfs/space/software/gatk4_pipeline/cromwell/workflows/test_catalogue_wgs/wgs_output

Variant calling¶

Variant Calling pipeline is a reduced version of WholeGenomeGermlineSingleSample pipeline.

The pipeline takes an aligned CRAM with index as an input. The pipeline's output includes GVCF containing variant calls with an corresponding index.

MoChA WDL pipeline¶

The pipeline runs under Singularity on UTHPC cluster.

Warning

MoChA pipeline runs under Singularity Cromwell, make sure your cromshell has been setup correctly. The instruction on how to setup cromshell or control ’CROMWELL_URL’ are here

WholeGenomeGermlineSingleSample (WGS) manual¶

Running a test WGS workflow¶

This test workflow bases on public test sample for gatk4 NA12878_20k. First, load Cromwell-tools module:

module load any/cromwell-tools

To run a test workflow, you have to copy the test directory to your project dir:

cp -R /gpfs/software/soft/manual/any/cromwell/workflows/test_catalogue_wgs /gpfs/hpc/projects/egv_hg38/wgs_output/

Warning

Make sure to configure cromshell before continuing with the next steps.

Create ’inputs’ and ’option’ files for workflow submission. From inside your sample directory, run createInputWGS.py script. It creates wgs_output dir inside the sample dir and two files WGGSS_NA12878_20k.inputs.json and WGGSS_NA12878_20k.option.json.

Here is how to do it:

#Go inside the sample catalogue
cd /gpfs/hpc/projects/egv_hg38/wgs_output/test_catalogue_wgs
#From inside test_catalogue_wgs
createInputWGS.py -i ./

Output:

CATALOGUE PATH:  /path/to/sample/catalogue/

WGS Input created:  /path/to/sample/catalogue/wgs_output/test_catalogue_wgs/NA12878_20k/WGGSS_NA12878_20k.inputs.json
WGS Options created:  /path/to/sample/catalogue/wgs_output/test_catalogue_wgs/NA12878_20k/WGGSS_NA12878_20k.option.json

Submit the workflow for sample ’NA12878_20k’

submit-WGS-batch.sh NA12878_20k | tee -a cromshell_logs

Output:

Workflow will run only for samples:  NA12878_20k


WGGSS workflow is submitted for:  NA12878_20k
sample_dir:  /gpfs/hpc/projects/egv_hg38/wgs_output/test_catalogue_wgs/NA12878_20k
Sub-Command: submit
Submitting job to server: 172.17.63.1:15000
{"id":"xxxxxxxx-xxxx-xxxx-xxxx","status":"Submitted"}

Here is a list of commands to look the submission status:

# check the last workflow status
cromshell status

# Display a list of jobs submitted through cromshell
cromshell list -c

# Check completion status of all unfinished jobs
cromshell list -u

Running your own workflow¶

Below is example, how to run your own workflow.

Go inside your sample directory with all samples

Warning

If you don't have write permission inside the sample directory, look if wgs_output dir exists already and you have write permission, if you don't, ask the owner to create wgs_output dir for you and give you permissions for wgs_output dir.
Run createInputWGS.py -i ./ from inside your sample directory. It creates wgs_output in the sample directory, if it doesn't exist already, and sample directories with inputs.json & option.json files for all samples inside wgs_output. <sample-catalogue>/wgs_output/<sample-name> is location of all output files after the workflow finishes.

Warning

Make sure to configure cromshell before continuing with the next steps.
Submit a batch of samples with submit-WGS-batch.sh script. Please, ensure that sample_name has no / character at the end. submit-WGS-batch.sh takes ’sample_names’ not directories.

Submission example:
```
submit-WGS-batch.sh <sample_name1> <sample_name2> <sample_name3> <sample_name4> | tee -a cromshell_logs
```
Important

cromshell_logs is an important file to identify sample name and workflow ID later, in case you have several samples to submit.

List of commands to see the submission status:

# check the last workflow status
cromshell status

# Display a list of jobs submitted through cromshell
cromshell list -c

# Check completion status of all unfinished jobs
cromshell list -u

Automation tools¶

Create inputs.json & option.json¶

createInputWGS.py is a python script looks for wgs_output dir in the sample directory and creates a sample dir inside wgs_output, which contains inputs.json and option.json for a given sample. The tool assumes that samples are in a directory of samples as shown below. Please, note submit-WGS-batch.sh also relies on the following structure.

Here is a structure example of a sample directory:

samples-catalogue/
├── V00000
│   ├── H7G2M.1.bam
│   ├── H7G2M.2.bam
│   ├── H7G2M.3.bam
│   ├── H7G2M.4.bam
│   ├── H7G2M.5.bam
│   ├── H7G2M.6.bam
│   └── H7G2M.7.bam
└── V00001
    ├── H1G1M.1.bam
    ├── H1G1M.2.bam
    ├── H1G1M.3.bam
    ├── H1G1M.4.bam
    ├── H1G1M.5.bam
    ├── H1G1M.6.bam
    └── H1G1M.7.bam

Usage example:

#run the tool
createInputWGS.py -i <path/to/catalogue/of/samples>

# help
createInputWGS.py -h

Sample directory after running createInputWGS.py:

samples-catalogue/
├── V00000
│   ├── H7G2M.1.bam
│   ├── H7G2M.2.bam
│   ├── H7G2M.3.bam
│   ├── H7G2M.4.bam
│   ├── H7G2M.5.bam
│   ├── H7G2M.6.bam
│   └── H7G2M.7.bam
├── V00001
│   ├── H1G1M.1.bam
│   ├── H1G1M.2.bam
│   ├── H1G1M.3.bam
│   ├── H1G1M.4.bam
│   ├── H1G1M.5.bam
│   ├── H1G1M.6.bam
│   └── H1G1M.7.bam
└── wgs_output
    ├── V00000
    │   ├── WGGSS_V00000.inputs.json
    │   └── WGGSS_V00000.option.json
    └── V00001
        ├── WGGSS_V00001.inputs.json
        └── WGGSS_V00001.option.json

Submit several samples¶

submit-WGS-batch.sh is a bash script that submits several samples into Cromwell server. It must run from the root of the sample directory.

Note

It's recommended to use | tee -a cromshell_logs seen in the example below as it appends info to cromshell_logs, so it would be possible to match a ’workflowID’ with a sample name.

cd <path/to/catalogue/of/samples>
submit-WGS-batch.sh <V00000> <V00001> ... | tee -a cromshell_logs

Examples wgs.inputs.json and wgs.option.json¶

wgs.inputs.json¶

{
    "WholeGenomeGermlineSingleSample.sample_and_unmapped_bams": {
        "sample_name": "NA12878_20k",
        "base_file_name": "NA12878_20k",
        "flowcell_unmapped_bams": [
            "/gpfs/space/home/<user>/test_catalogue_wgs/NA12878_20k/NA12878_A.bam",
            "/gpfs/space/home/<user>/test_catalogue_wgs/NA12878_20k/NA12878_B.bam",
            "/gpfs/space/home/<user>/test_catalogue_wgs/NA12878_20k/NA12878_C.bam"
        ],
        "final_gvcf_base_name": "NA12878_20k",
        "unmapped_bam_suffix": ".bam"
    },
    "WholeGenomeGermlineSingleSample.references": {
        "contamination_sites_ud": "/gpfs/hpc/databases/broadinstitute/gpc-public-data--broad-references/hg38/v0/contamination-resources/1000g/1000g.phase3.100k.b38.vcf.gz.dat.UD",
        "contamination_sites_bed": "/gpfs/hpc/databases/broadinstitute/gpc-public-data--broad-references/hg38/v0/contamination-resources/1000g/1000g.phase3.100k.b38.vcf.gz.dat.bed",
        "contamination_sites_mu": "/gpfs/hpc/databases/broadinstitute/gpc-public-data--broad-references/hg38/v0/contamination-resources/1000g/1000g.phase3.100k.b38.vcf.gz.dat.mu",
        "calling_interval_list": "/gpfs/hpc/databases/broadinstitute/gpc-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list",
        "reference_fasta": {
            "ref_dict": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.dict",
            "ref_fasta": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta",
            "ref_fasta_index": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
            "ref_alt": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt",
            "ref_sa": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa",
            "ref_amb": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb",
            "ref_bwt": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt",
            "ref_ann": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann",
            "ref_pac": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac"
        },
        "known_indels_sites_vcfs": [
            "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
            "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz"
        ],
        "known_indels_sites_indices": [
            "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
            "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
        ],
        "dbsnp_vcf": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
        "dbsnp_vcf_index": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
        "evaluation_interval_list": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/wgs_evaluation_regions.hg38.interval_list",
        "haplotype_database_file": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.haplotype_database.txt"
    },
    "WholeGenomeGermlineSingleSample.scatter_settings": {
        "haplotype_scatter_count": 10,
        "break_bands_at_multiples_of": 100000
    },
    "WholeGenomeGermlineSingleSample.wgs_coverage_interval_list": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/wgs_coverage_regions.hg38.interval_list",
    "WholeGenomeGermlineSingleSample.papi_settings": {
        "preemptible_tries": 3,
        "agg_preemptible_tries": 3
    }
}

wgs.option.json¶

{
    "final_workflow_outputs_dir": "/gpfs/space/home/<user>/test_catalogue_wgs/wgs_output/NA12878_20k",
    "use_relative_output_paths": "true"
}

VariantCalling manual¶

Running a test VariantCalling workflow¶

This test workflow bases on crams of the public test sample for gatk4 NA12878_20k.

Warning

Make sure to correctly configure cromshell . Check CROMWELL_URL=172.17.63.1:15000 by issuing cromshell -h command.

First, load Cromwell-tools module:

module load any/cromwell-tools

To run a test workflow, you have to copy the test directory to your project dir (Make sure that gvaramu has write permission to the dir):

cp -R /gpfs/space/software/gatk4_pipeline/cromwell/workflows/test_catalogue_wgs/NA12878_20k_cram </path/to/project/dir>

Create inputs and option files for workflow submission. From inside your sample directory containing crams, run createVCinput.py script. It creates vc_output dir with NA12878_20k sample dir and two files vc_NA12878_20k.inputs.json and vc_NA12878_20k.option.json.

Here is how to do it:

#Go inside the sample catalogue
cd </path/to/project/dir/>NA12878_20k_cram
#From inside NA12878_20k_cram
createVCinput.py NA12878_20k

Output:

SAMPLES ['NA12878_20k']
CATALOGUE PATH:  </path/to/project/dir/>/NA12878_20k_cram/

vc_output dir created in CATALOGUE PATH
VC Input created:  </path/to/project/dir/>/NA12878_20k_cram/vc_output/NA12878_20k/vc_NA12878_20k.inputs.json
VC Options created:  </path/to/project/dir/>/NA12878_20k_cram/vc_output/NA12878_20k/vc_NA12878_20k.option.json

Submit the workflow for sample NA12878_20k

submitVC-batch NA12878_20k | tee -a cromshell_logs

Output:

Workflow will run only for samples:  NA12878_20k


WGGSS workflow is submitted for:  NA12878_20k
sample_dir:   </path/to/project/dir/>/NA12878_20k_cram
Sub-Command: submit
Submitting job to server: 172.17.63.1:15000
{"id":"xxxxxxxx-xxxx-xxxx-xxxx","status":"Submitted"}

Here is a list of commands to see the submission status:

# check the last workflow status
cromshell status

# Display a list of jobs submitted through cromshell
cromshell list -c

# Check completion status of all unfinished jobs
cromshell list -u

Running your own VC workflow¶

Please, read the Running a test VariantCalling section as it provides more exhaustive instructions.

Warning

Make sure to correctly configure cromshell . Check CROMWELL_URL=172.17.63.1:15000 by issuing cromshell -h command.

Go inside your sample directory, witch is directory with all samples.

Warning

If you don't have write permission inside the sample directory, look if vc_output dir exists already and you have write permission, if you don't, ask the owner to create vc_output dir for you and give you permissions for vc_output dir.
Run createVCinput.py <sample name> from inside your sample directory. It creates vc_output in the sample directory, if it doesn't already exist, and sample directories with inputs.json & option.json files for all samples inside vc_output. <sample-catalogue>/vc_output/<sample-name> is the location of all output files after workflow finishes.
Submit a batch of samples with submitVC-batch command. Please, ensure that ’sample_names’ have no extension like .cram. submitVC-batch takes ’sample_names’ not directories. Submission example:
```
submitVC-batch <sample_name1> <sample_name2> <sample_name3> <sample_name4> | tee -a cromshell_logs
```
Important

cromshell_logs is an important file to identify sample name and workflow ID later, in case you have several samples to submit.

Here is a list of commands to see the submission status:

# check the last workflow status
cromshell status

# Display a list of jobs submitted through cromshell
cromshell list -c

# Check completion status of all unfinished jobs
cromshell list -u

VC automation tools¶

Create vc.inputs.json and vc.option.json¶

createVCinput.py is a python script creates a sample dir containing vc.inputs.json and vc.option.json for a given sample inside vc_output. The tool assumes that sample crams are in the directory of samples as shown below.

Here is a structure example of a sample directory:

samples-catalogue/
├── V00000.cram
├── V00000.cram.crai
├── V00001.cram
└── V00001.cram.crai

Usage example:

# submit a sample
createVCinput.py <sample1> <sample2> ... <samplen>

# help
createVCinput.py -h

Sample directory after running createVCinput.py

samples-catalogue/
├── V00000.cram
├── V00000.cram.crai
├── V00001.cram
├── V00001.cram.crai
└── vc_output
    ├── V00000
    │   ├── vc_V00000.inputs.json
    │   └── vc_V00000.option.json
    └── V00001
        ├── vc_V00001.inputs.json
        └── vc_V00001.option.json

Submit samples to VC workflow¶

submitVC-batch is a bash script that submits several samples into Cromwell server. It takes care of defining all the variables required for running the VC workflow, so you have to provide only the sample name. It must run from the root of the sample directory.

Note

It's recommended to use | tee -a cromshell_logs seen in the example below as it appends info to cromshell_logs, so it would be possible to match a workflowID with a sample name.

Usage example:

cd <path/to/catalogue/of/samples>
submitVC-batch <V00000> <V00001> ... | tee -a cromshell_logs

Examples vc.inputs.json and vc.option.json¶

vc.inputs.json¶

{
    "WholeGenomeGermlineSingleSample.sample_and_mapped_crams": {
        "sample_name": "NA12878_20k",
        "base_file_name": "NA12878_20k",
        "final_gvcf_base_name": "NA12878_20k",
        "input_cram": "<path/to/test/catalogue>/NA12878_20k_cram/NA12878_20k.cram",
        "input_cram_index": "<path/to/test/catalogue>/NA12878_20k_cram/NA12878_20k.cram.crai"
    },
    "WholeGenomeGermlineSingleSample.references": {
        "contamination_sites_ud": "/gpfs/hpc/databases/broadinstitute/gpc-public-data--broad-references/hg38/v0/contamination-resources/1000g/1000g.phase3.100k.b38.vcf.gz.dat.UD",
        "contamination_sites_bed": "/gpfs/hpc/databases/broadinstitute/gpc-public-data--broad-references/hg38/v0/contamination-resources/1000g/1000g.phase3.100k.b38.vcf.gz.dat.bed",
        "contamination_sites_mu": "/gpfs/hpc/databases/broadinstitute/gpc-public-data--broad-references/hg38/v0/contamination-resources/1000g/1000g.phase3.100k.b38.vcf.gz.dat.mu",
        "calling_interval_list": "/gpfs/hpc/databases/broadinstitute/gpc-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list",
        "reference_fasta": {
            "ref_dict": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.dict",
            "ref_fasta": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta",
            "ref_fasta_index": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
            "ref_alt": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt",
            "ref_sa": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa",
            "ref_amb": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb",
            "ref_bwt": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt",
            "ref_ann": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann",
            "ref_pac": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac"
        },
        "known_indels_sites_vcfs": [
            "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
            "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz"
        ],
        "known_indels_sites_indices": [
            "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
            "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
        ],
        "dbsnp_vcf": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
        "dbsnp_vcf_index": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
        "evaluation_interval_list": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/wgs_evaluation_regions.hg38.interval_list",
        "haplotype_database_file": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/Homo_sapiens_assembly38.haplotype_database.txt"
    },
    "WholeGenomeGermlineSingleSample.scatter_settings": {
        "haplotype_scatter_count": 10,
        "break_bands_at_multiples_of": 100000
    },
    "WholeGenomeGermlineSingleSample.wgs_coverage_interval_list": "/gpfs/hpc/databases/broadinstitute/references/hg38/v0/wgs_coverage_regions.hg38.interval_list",
    "WholeGenomeGermlineSingleSample.papi_settings": {
        "preemptible_tries": 3,
        "agg_preemptible_tries": 3
    }
}

vc.option.json¶

{
    "final_workflow_outputs_dir": "<path/to/test/catalogue>/NA12878_20k_cram/vc_output/NA12878_20k",
    "use_relative_output_paths": "true"
}

MoChA manual¶

Warning

Make sure to correctly configure cromshell . Check CROMWELL_URL=172.17.63.3:14000 by issuing cromshell -h command.

Running a test Illumina example¶

Please, read GitHub page for Illumina Example and download all necessary files to running the Illumina example. GitHub Illumina example

Create/change illumina_example.json and illumina.option.json to the following examples below. Substitute paths. illumina_example.json example:

{
   "mocha.sample_set_id": "hapmap370k",
   "mocha.mode": "idat",
   "mocha.realign": true,
   "mocha.max_win_size_cm": 300.0,
   "mocha.overlap_size_cm": 5.0,
   "mocha.ref_name": "GRCh38",
   "mocha.ref_path": "</path/to/downloaded/>GRCh38",
   "mocha.manifest_path": "</path/to/downloaded/>manifests",
   "mocha.data_path": "</path/to/downloaded/>idats",
   "mocha.batch_tsv_file": "</path/to/downloaded/>tsvs/hapmap370k.batch.tsv",
   "mocha.sample_tsv_file": "</path/to/downloaded/>tsvs/hapmap370k.sample.tsv",
   "mocha.ped_file": "</path/to/downloaded/>hapmap370k.ped",
   "mocha.docker_registry": "/tmp/singularity_img/",
   "mocha.do_not_check_bpm": true
}

Note

Make sure that gvaramu group has write permission to the project dir and workflow output dir specified in option.json. chmod 775 <dir>

illumina.option.json example:

{
   "final_workflow_outputs_dir": "</path/to/upload/the/output>",
   "use_relative_output_paths": "true"
}

Load Cromwell-tools module:
```
module load any/cromwell-tools
```
Submit the workflow to Cromwell. $MOCHA variable loaded with the Cromwell-tools module and it defines a path to mocha.wdl tested on UTHPC cluster.
```
cromshell submit $MOCHA illumina_example.json options.json
```

Running your own MoChA workflow¶

Warning

Make sure to correctly configure cromshell . Check CROMWELL_URL=172.17.63.3:14000 by issuing cromshell -h command.

Create inputs.json similar to illumina_example.json from ’Running a test Illumina Example’ section.

Create options.json from the example below and replace the output dir path

{
   "final_workflow_outputs_dir": "</path/to/upload/the/output>",
   "use_relative_output_paths": "true"
}

Load the Cromwell-tools and Submit the workflow

module load any/cromwell-tools
cromshell submit $MOCHA <your/inputs>.json <your/options>.json

Imputation pipeline¶

Warning

Make sure to correctly configure cromshell . Check CROMWELL_URL=172.17.63.3:14000 by issuing cromshell -h command.

Running a test imputation example¶

Please, read GitHub page for Imputation Example and download all necessary files to running the Imputation example. GitHub Imputation example .

Create/change impute.inputs.json and impute.options.json to the following examples below. Substitute paths. impute.inputs.json example:

{
   "impute.sample_set_id": "hapmap370k",
   "impute.mode": "pgt",
   "impute.target": "ext",
   "impute.batch_tsv_file": "</path/to/>tsvs/impute.hapmap370k.batch.tsv",
   "impute.max_win_size_cm": 50.0,
   "impute.overlap_size_cm": 5.0,
   "impute.target_chrs": ["chr12", "chrX"],
   "impute.ref_name": "GRCh38",
   "impute.ref_path": "</path/to/>GRCh38",
   "impute.data_path": "</path/to/>output",
   "impute.beagle": false
}

Note

Make sure that gvaramu group has write permission to the project dir and workflow output dir specified in ’option.json’. chmod 775 <dir>

impute.options.json example:

{
   "final_workflow_outputs_dir": "</path/to/upload/the/output>",
   "use_relative_output_paths": "true"
}

Load Cromwell-tools module:
```
module load any/cromwell-tools
```
Submit the workflow to Cromwell. $IMPUTE variable loaded with the Cromwell-tools module and it defines a path to mocha.wdl tested on UTHPC cluster.
```
cromshell submit $IMPUTE impute.inputs.json impute.options.json
```

Running your own impute workflow¶

Warning

Make sure to correctly configure cromshell . Check CROMWELL_URL=172.17.63.3:14000 by issuing cromshell -h command.

Create inputs.json similar to impute.inputs.json from ’Running a test Impute Example’ section.

Create options.json from the example below and replace the output dir path

{
   "final_workflow_outputs_dir": "</path/to/upload/the/output>",
   "use_relative_output_paths": "true"
}

Load the Cromwell-tools and Submit the workflow

module load any/cromwell-tools
cromshell submit $IMPUTE <your/inputs>.json <your/options>.json

Allelic Shift pipeline¶

Warning

Make sure to correctly configure cromshell . Check CROMWELL_URL=172.17.63.3:14000 by issuing cromshell -h command.

Running a test Allelic Shift example¶

Please, read GitHub page for Allelic Shift Pipeline and download all necessary files to running the Allelic Shift Pipeline example. GitHub Allelic Shift Pipeline

Create/change shift.json and shift.options.json to the following examples below. Substitute paths. shift.json example:

{
   "shift.sample_set_id": "hapmap370k",
   "shift.region": "chrX",
   "shift.samples_file": "</path/to/>tsvs/hapmap370k.mLOX.lines",
   "shift.batch_tsv_file": "</path/to/>tsvs/shift.hapmap370k.batch.tsv",
   "shift.ref_path": "</path/to/>GRCh38",
   "shift.data_path": "</path/to/>impute_output"
}

Note

Make sure that gvaramu group has write permission to the project dir and workflow output dir specified in ’option.json’. chmod 775 <dir>

shift.options.json example:

{
   "final_workflow_outputs_dir": "</path/to/upload/the/output>",
   "use_relative_output_paths": "true"
}

Load Cromwell-tools module:
```
module load any/cromwell-tools
```
Submit the workflow to Cromwell. $SHIFT variable loaded with the Cromwell-tools module and it defines a path to shift.wdl tested on UTHPC cluster.
```
cromshell submit $SHIFT shift.json shift.options.json
```

Running your own Shift workflow¶

Warning

Make sure to correctly configure cromshell . Check CROMWELL_URL=172.17.63.3:14000 by issuing cromshell -h.

Create inputs.json similar to shift.json from "Running a test Shift Example" section.

Create shift.options.json from the example below and replace the output dir path

{
   "final_workflow_outputs_dir": "</path/to/upload/the/output>",
   "use_relative_output_paths": "true"
}

Load the Cromwell-tools and Submit the workflow

module load any/cromwell-tools
cromshell submit $SHIFT <your/inputs>.json <your/options>.json

Troubleshooting¶

If there is any issues or questions in regards to Workflows, Cromwell and/or GATK4, please, contact support@hpc.ut.ee for a support.