Skip to content

GPU computing

Introduction

Important

Please make sure to read Module and Submitting Jobs

Hardware

To view the available hardware, follow the link Resources GPUs .

Using GPU nodes

GPU partition

There is one queue governing ’gpu’ nodes, a job's time limit is 8 days

To view the latest details:

sinfo p | grep gpu

And for a list of available GPU types

gpu:tesla
gpu:a100-40g
gpu:a100-80g

GPU jobs

To use the GPUs in the GPU partition available on rocket, you should specify:

  • The GPU partition.
  • The number of GPUs you wish to use.
  • All other resources you need. You can request them as usual (see Specifying Resources ).
    #SBATCH --partition=gpu
    #SBATCH --gres=gpu:<type>:<number_of_GPUs_you_want_to_use>
    

Note

Asking for a GPU is exclusive, which means that once Slurm allocates a GPU for your job, and your job only, no-one else can use that GPU until your job has finished.

You can also start an interactive session with a graphics card on similar conditions as described under Interactive jobs. For instance:

srun -p gpu --gres gpu:tesla:1 --pty bash

Important

This is only for quick testing, not running actual jobs. While interactive jobs are nice for setting up, you have to ensure that your session stays open for the full length of the job. This might cause issues with your job if the job you're running from gets impacted. You don't need to keep the session open with an sbatch script.

GPU tools

The GPU nodes already have pre-installed CUDA and cuDNN tools. If your applications needs different versions, then the module system has alternative versions available.

module avail <software_name>

Please remember that not all software you might want to use has GPU support, and if it does, it might not run in GPU-mode automatically. If you need help with installing additional GPU-compliant software not yet available on the cluster, please contact support@hpc.ut.ee .

Warning

Running processes without the queue system on the new GPUs is strictly forbidden. This is to prevent people from accidentally destroying other people's jobs. Such processes are immediately killed, and it might result in revocation of UTHPC permissions.

Info

To monitor your GPU usage in real time, you can start an interactive session in an already running job as described in Interactiver jobs

GPU monitoring

To interactively monitor GPU usage while the job is running, you first need to open an interactive session (or "get a shell") inside the job. The process is described in Interactive jobs

The nvidia-smi utility gives a brief overview of "how much of the GPU" your job is using. Pairing that with the watch command as below will rerun nvidia-smi on a one-second interval and provide a close-to-real-time view of GPU utilisation.

watch -n 1 nvidia-smi
CTRL + C to quit.

Below is a sample output of nvidia-smi You see the running process with 'PID X' in the second table. The first table lists the GPUs by their ID, the ’Volatile GPU-Util’ column shows for the GPU utilisation. If your code uses less than 50% of the GPU, you should try to improve the data loading / CPU part of your code as you don't use the full potential of the GPU.

[user@falcon1 ~]$ nvidia-smi 
Tue Jan  2 17:12:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   2  Tesla V100-PCIE-32GB            Off| 00000000:09:00.0 Off |                    0 |
| N/A   76C    P0               89W / 250W|  26198MiB / 32768MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    2   N/A  N/A     31678      C   python                                    26194MiB |
+---------------------------------------------------------------------------------------+