Lab 4 - Running jobs on the cluster¶

Introduction¶

Welcome to the jobs lab. In this session we're going to take the training wheels off and let you write and optimize your own job scripts (a.k.a. batch scripts).

We'll also acquaint ourselves with different ways of getting info about the cluster, cover monitoring a job's resource usage, and practice debugging in an interactive srun session.

Getting to know the cluster¶

In this section you'll learn about the structure of the cluster and the commands that give you information about it.

A good primer on the subject can be found on our documentation page here.

Complete

You should familiarize yourself with the concepts on the page linked above

Additionally, try running the commands listed there and make sense of the output.

Hint: to get less of a wall of text, you can the scontrol show partition command individually for each partition by appending the partition name, such as scontrol show partition intel.

Now that you've seen the partitions, lets take a look at the nodes that comprise them. A useful tool for getting a glance of the state of the cluster is the slstat command. It prints out information about each node in the cluster, by default with the following fields:

Field	Content
Name	Name of the node
State	State of the node. Can be `idle`, `mixed`, `allocated`, `reserved`, `draining`, `drained` or `down`
CPU info	Total number of CPU cores on the node and how many of them are available
Load5	System load averaged over 5 minutes. Load over no. of CPUs on machine means processes are waiting in queue and the system is overloaded
Memory free	Total and available RAM on the node
Username:JobId	The jobs currently running on the node

Info

Further info on the different states nodes can be in:

Idle means the node is up and has no jobs allocated to it.
Mixed means the node has jobs allocated, but still has resources available for additional ones.
Allocated means that all of the nodes resources are allocated to jobs.
Reserved means the node is part of some reservation and not currently accessible to the general queue.
Draining/Drained means the node is being pulled out of the queue and is no longer accepting new jobs. Usually a sign that maintainance is about to be done on it.
Down means that the node is taken out of the queue for repairs.

You can explore further options of slstat with

slstat --help

Array jobs¶

Have you ever had to submit hundreds of jobs that only differ by one parameter? Instead of looping a sbatch submitter script, array jobs are the simpler solution for this.

Array jobs allow You to specify an extra parameter in sbatch named --array that can accept an integer range that will then be converted to $SLURM_ARRAY_TASK_ID inside the sbatch script. This allows for quick loops over large datasets, potentially submitting thousands of jobs in a single step. The array job will have one main job id, and the sub-task id will be specified with _<array task id>. Further reading can be found at the official SLURM documentation site

We will now run through a quick demonstration of using arrays

First, we create a simple sbatch task under lab4/array_job/

#!/bin/bash

#SBATCH --partition=amd
#SBATCH --time=10
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --array=0-15%4
#SBATCH --account=<your etais allocation>

echo $SLURM_ARRAY_TASK_ID
sleep 5

The idea of the script is quite simple, we just echo the value of the array task id. Our scoring will pick up the generated output files and match if all 16 tasks ran successfully.

One thing to note here, is the % modifier for the task, this allows us to specify parallelisation on a task level, namely the number 4 here says that a maximum of 4 tasks are allowed to run concurrently. This parameter can be tuned to your liking if you are submitting non-test jobs.

We submit the job with sbatch, and then we can see some additional information for the array job

Firstly, scontrol show job <id> will now specify that it is an array job. We can see a similar change in the output of squeue

[alk@login1 lab4]$ scontrol show job 48501886 | grep -i arr
JobId=48501886 ArrayJobId=48501886 ArrayTaskId=0-15%4 ArrayTaskThrottle=4 JobName=array_submit.sh

PS:Replace the username "alk" with your own when checking queue status

[alk@login1 lab4]$ squeue -u alk
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 48501886_[0-31%4]       amd array_su      alk PD       0:00      1 (None)

Once the job finishes the configuration step, you might notice that squeue will report that it has hit the JobArrayTaskLimit. This means that 4 jobs are running concurrently and the limit specified by %4 has been reached.

[alk@login1 lab4]$ squeue -u alk
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 48501886_[8-31%4]       amd array_su      alk PD       0:00      1 (JobArrayTaskLimit)

Should we wish to change this limit during runtime, it is entirely possible with

scontrol  update jobid=<job id> arraytaskthrottle=<new value>

We can also get a more detailed overview of the array status by listing all of the sub-tasks in the array using squeue -r

[alk@login1 lab4]$ squeue -u alk -r
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        48501886_8       amd array_su      alk CG       0:21      1 ares1
       48501886_29       amd array_su      alk PD       0:00      1 (JobArrayTaskLimit)
       48501886_30       amd array_su      alk PD       0:00      1 (JobArrayTaskLimit)
...
       48501886_31       amd array_su      alk PD       0:00      1 (JobArrayTaskLimit)
        48501886_9       amd array_su      alk  R       0:21      1 ares1
       48501886_10       amd array_su      alk  R       0:21      1 ares1
       48501886_11       amd array_su      alk  R       0:21      1 ares1

Once the job has finished, we will see 16 Slurm output files under lab4/array_job/ and our scoring check should turn green.

More keen-eyed of You might have noticed that sadly, array jobs will only iterate over integers. The solution to this would be to write a small mapping script that can convert an integer value to a non-integer value. However, since this task is trivial it is left as a voluntary exercise for the reader.

Running a single task on multiple CPUs¶

You've dealt with parallelisation a little bit already in the array-jobs section. Here we'll cover the key points of another way of parallelising your workflows.

There are two main points to running a task on multiple CPU cores in parallel.

The tool you are using must be written in a way that it can take advantage of multiple cores i.e is able to parallelize. Reading the tool's documentation is the best source for this kind of information.
If you request multiple CPUs from Slurm in your batch script, you must also make sure your tool knows that they are available. Some tools have automatic detection for this, for others it has to be explicitly stated.

We'll take a look at a tool of the latter kind.

Complete

Create a directory named parallel-job into your lab4 directory
Copy into it the batch script named aln_synth_human.sh from /gpfs/space/projects/hpc-course/scripts/lab4/
Look it over, fill out your ETAIS allocation and queue it with the sbatch command

Keeping an eye on your job while it is running¶

After you've submitted your job, you might want to keep an eye on it and its resource utilisation. The most basic info can be obtained from the squeue command, as you've seen in the array-jobs section... Squeue

...and more verbose output can be obtained with the scontrol show job command.

Try it out for yourself:

scontrol show job your_job's_ID

However, if you want to view your job's CPU and RAM utilisation on a graph, you need to look elsewhere. Namely, HPC's Elastic dashboard elk.hpc.ut.ee Open the website in your browser. You can use either University of Tartu or MyAccessId authentication to log in, however for this lab you should use MyAccessId.

A quickstart guide is available in our docs.

As fine-tuning your job's resource allocation is almost always a case of trial and error, having a graph of actual resource utilisation plotted against time can be extremely useful. Below is an example of the same job being ran with various resource allocations. Notice, that in all cases the job makes full use of its CPU allocation, but the RAM allocation has been overshot by nearly a factor of 5.

NB: Both the CPU utilisation and RAM utilisation graphs have a time step of 5 minutes. elk You can experiment with the resource allocations in the aln_synth_human.sh script to get it to run as efficiently as possible. There is no check for this. Note the -t option in the script. This is the option that tells the bwa mem tool how many CPU cores it should use.

Extra reading - Failed compression and fast(er) SSD space demonstration¶

While there are many aspects to running jobs on an HPC cluster, disk bandwidth may be a limiter in some cases. While a single CPU core is capable of churning data around 20Gbit/s, running multiple cores at the same time will usually start hitting hardware limits on the filesystem. Or at least it should

To demonstrate this, I wrote a small script that compared the compression of a single randomly generated text file about 1 GB in size

#!/bin/bash

#SBATCH --partition=amd
#SBATCH --time=10
#SBATCH --ntasks=1
#SBATCH --job-name=compression_test
#SBATCH --cpus-per-task=24
#SBATCH --mem=24g
#SBATCH --account=ealloc_f362e_tommy6

#Regular compression on /gpfs/space
rsync -av /gpfs/space/projects/hpc-course/random_data.txt .
echo Compression on /gpfs/space
time xz --threads=24 random_data.txt

#Transfer to /tmp and compress
JOB_DIR=/tmp/$SLURM_JOB_ID
mkdir $JOB_DIR
chmod 700 $JOB_DIR
rsync -av /gpfs/space/projects/hpc-course/random_data.txt $JOB_DIR/
#Compute using $JOB_DIR for path values
echo Compression on /tmp
time xz --threads=24 $JOB_DIR/random_data.txt
#We would normally rsync the output back, but not necessary here
rm -rf $JOB_DIR

xz was used as it is one of the few compression tools capable of multi-threading. This was tested in 4, 8, 16 and 24 core configurations with varying memory limits since the algorithm gets more memory hungry when running on multiple threads.

After submitting and waiting for the job to finish (was around 2 minutes), we can examine the Slurm log file for output.

sending incremental file list

sent 67 bytes  received 12 bytes  158.00 bytes/sec
total size is 1,073,792,788  speedup is 13,592,313.77
Compression on /gpfs/space
real    0m21.861s
user    7m19.629s
sys     0m2.652s

sending incremental file list
random_data.txt

sent 1,074,055,050 bytes  received 35 bytes  429,622,034.00 bytes/sec
total size is 1,073,792,788  speedup is 1.00
Compression on /tmp
real    0m22.139s
user    7m27.629s
sys     0m1.686s

You are reading it correctly, compression took longer when reading from local fast SSD space. While not an apples-to-apples comparison test, there are probably a few factors related to the inability of reproducing lag on the HPC filesystem. And why this section of the lab is intended more as a reading part.

It is highly likely that in combination with the algorithm used xz and the filesystem tuning parameters, that our storage controllers saw the test file as quite popular and decided to keep it in its RAM for the time being. This, in combination with our InfiniBand network and RDMA capabilities (allows a compute node to read the memory of the controller directly) probably meant that our filesystem was using RAM as its actual location.

Similar tests failed with gzip and the only way to achieve any kind of remarkable results were to run the test on a filesystem that had severely hindered limits in place (10 MB/s total bandwidth). But then the case became hard to demonstrate within meaningful context for this course. Another thing to consider here is the relatively short runtime and small file. Larger files will behave differently and furthermore, there is a small penalty that occurs when transferring the data in and out of /tmp.

The actual key lesson learned here is that for single jobs, the HPC filesystem is fast enough to handle anything thrown at it. This does not mean that it is the case all the time, since the cluster runs a stable 1000 concurrent jobs at any given time, with peaks and lows. Running hundreds of copies of compression has been known to create considerable load on /gpfs/space, and it is still recommended to run larger compression jobs on /tmp.