GPU computing¶
Introduction¶
Important
Please make sure to read Module and Submitting Jobs
Hardware¶
To view the available hardware, follow the link Resources GPUs .
Using GPU nodes¶
GPU partition¶
There is one queue governing ’gpu’ nodes, a job's time limit is 8 days
To view the latest details:
sinfo p | grep gpu
And for a list of available GPU types
gpu:tesla
gpu:a100-40g
gpu:a100-80g
GPU jobs¶
To use the GPUs in the GPU partition available on rocket, you should specify:
- The GPU partition.
- The number of GPUs you wish to use.
- All other resources you need. You can request them as usual (see Specifying Resources ).
#SBATCH --partition=gpu #SBATCH --gres=gpu:<type>:<number_of_GPUs_you_want_to_use>
Note
Asking for a GPU is exclusive, which means that once Slurm allocates a GPU for your job, and your job only, no-one else can use that GPU until your job has finished.
You can also start an interactive session with a graphics card on similar conditions as described under Interactive jobs. For instance:
srun -p gpu --gres gpu:tesla:1 --pty bash
Important
This is only for quick testing, not running actual jobs. While interactive jobs are nice for setting up, you have to ensure that your session stays open for the full length of the job. This might cause issues with your job if the job you're running from gets impacted. You don't need to keep the session open with an sbatch script.
GPU tools¶
The GPU nodes already have pre-installed CUDA and cuDNN tools. If your applications needs different versions, then the module system has alternative versions available.
module avail <software_name>
Please remember that not all software you might want to use has GPU support, and if it does, it might not run in GPU-mode automatically. If you need help with installing additional GPU-compliant software not yet available on the cluster, please contact support@hpc.ut.ee .
Warning
Running processes without the queue system on the new GPUs is strictly forbidden. This is to prevent people from accidentally destroying other people's jobs. Such processes are immediately killed, and it might result in revocation of UTHPC permissions.
GPU monitoring¶
To interactively monitor GPU while the job is running, ’ssh’ into the GPU node in question and run the following command:
watch -n 1 nvidia-smi
watch
updates the GPU usage with 1 second interval. You see the running process with ’PID X’ in the second table. The first table lists the GPUs by their ID, the ’Volatile GPU-Util’ column shows for the GPU utilisation. If your code uses less than 50% of the GPU, you should try to improve the data loading / CPU part of your code as you don't use the full potential of the GPU.
Created: 2022-04-28