GPU computing¶

Introduction¶

Important

Please make sure to read Module and Submitting Jobs

Hardware¶

To view the available hardware, follow the link Resources GPUs .

Using GPU nodes¶

GPU partition¶

There is one queue governing ’gpu’ nodes, a job's time limit is 8 days

To view the latest details:

sinfo p | grep gpu

And for a list of available GPU types

gpu:tesla
gpu:a100-40g
gpu:a100-80g

GPU jobs¶

To use the GPUs in the GPU partition available on rocket, you should specify:

The GPU partition.
The number of GPUs you wish to use.
All other resources you need. You can request them as usual (see Specifying Resources ).
```
#SBATCH --partition=gpu
#SBATCH --gres=gpu:<type>:<number_of_GPUs_you_want_to_use>
```

Note

Asking for a GPU is exclusive, which means that once Slurm allocates a GPU for your job, and your job only, no-one else can use that GPU until your job has finished.

You can also start an interactive session with a graphics card on similar conditions as described under Interactive jobs. For instance:

srun -p gpu --gres gpu:tesla:1 --pty bash

Important

This is only for quick testing, not running actual jobs. While interactive jobs are nice for setting up, you have to ensure that your session stays open for the full length of the job. This might cause issues with your job if the job you're running from gets impacted. You don't need to keep the session open with an sbatch script.

Correct GPU job allocation¶

When scheduling jobs with GPUs, there are also some CPUs (cores) allocated to the job. To have good GPU to CPU communication, you want the cores that are allocated to be close to the GPU that you get. This is to some extent done by Slurm, but there are parameters the user can set to help.

You can specifically bind a GPU to a task, in which case it will automatically try to allocate the correct predefined cores to this task. This can be done using the Slurm flag --gpus-per-task.

During scheduling you can allocate a certain number of cores to using the flag --cpus-per-task. This parameter, if using 1 GPU per task, should be set to the number of cores on the node divided by the number of GPUs on the node. For multiple GPUs per task the formula would be (cores_on_node/GPUs_on_node)*GPUs_per_task. The table for values that you should use are below.

Node	Cores	GPUs	Recommended --cpus-per-task value
falcon1-2	24	7	3
falcon3	48	8	6
falcon4-6	32	8	4
pegasus	96	4	24
pegasus2	128	8	16

Due to the differing values in the tables above, during scheduling you should ideally specify the node on which you wish to execute on, since using --gpus-per-task is not compatible with specifying via --gres. For instance if I wanted to run a job on pegasus2 with 1 GPU, I would schedule like this:

srun -p gpu -w pegasus2 -n 1 -N 1 --cpus-per-task 16 --gpus-per-task 1 --hint=nomultithread python main.py

The flag -N is useful for jobs using multiple GPUs so that you do not get allocated GPUs from different nodes. The other flag --hint=nomultithread disables multithreading and allocates full cores to the jobs. We specify the node with -w and we can add multiple comma separated arguments to this (for falcon1-2 and 4-6).

GPU tools¶

The GPU nodes already have pre-installed CUDA and cuDNN tools. If your applications needs different versions, then the module system has alternative versions available.

module avail <software_name>

Please remember that not all software you might want to use has GPU support, and if it does, it might not run in GPU-mode automatically. If you need help with installing additional GPU-compliant software not yet available on the cluster, please contact support@hpc.ut.ee .

Warning

Running processes without the queue system on the new GPUs is strictly forbidden. This is to prevent people from accidentally destroying other people's jobs. Such processes are immediately killed, and it might result in revocation of UTHPC permissions.

Info

To monitor your GPU usage in real time, you can start an interactive session in an already running job as described in Interactiver jobs

GPU monitoring¶

To interactively monitor GPU usage while the job is running, you first need to open an interactive session (or "get a shell") inside the job. The process is described in Interactive jobs

The nvidia-smi utility gives a brief overview of "how much of the GPU" your job is using. Pairing that with the watch command as below will rerun nvidia-smi on a one-second interval and provide a close-to-real-time view of GPU utilisation.

watch -n 1 nvidia-smi

CTRL + C to quit.

Below is a sample output of nvidia-smi You see the running process with 'PID X' in the second table. The first table lists the GPUs by their ID, the ’Volatile GPU-Util’ column shows for the GPU utilisation. If your code uses less than 50% of the GPU, you should try to improve the data loading / CPU part of your code as you don't use the full potential of the GPU.

[user@falcon1 ~]$ nvidia-smi 
Tue Jan  2 17:12:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   2  Tesla V100-PCIE-32GB            Off| 00000000:09:00.0 Off |                    0 |
| N/A   76C    P0               89W / 250W|  26198MiB / 32768MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    2   N/A  N/A     31678      C   python                                    26194MiB |
+---------------------------------------------------------------------------------------+