Skip to content

Job limits and queueing

Introduction

Every job submitted to the scheduling system is subject to resource limits. This is necessary to prevent, for example, one user from taking up all the resources of the entire cluster for an indefinite time period. In general, UTHPC tries to specify the limits based on the following guidelines:

  • The number of long-running jobs should be minimal - long-running jobs that don't have checkpoints and can't resume work in the event of an interruption, such as a system failure, are generally inefficient and error-prone, and can easily result in a significant amount of wasted CPU time.
  • Shorter jobs should have greater priority – short jobs allow the scheduling system to place the work of different users on available resources more optimally, therefore increasing the overall throughput of the cluster.
  • Parallel jobs have greater priority – since parallel jobs are essentially what compute clusters are for, they get prioritized as much as possible. Since they can also be difficult to test and debug, and may need multiple restarts for example due to changing a parameter, this prioritization also shortens their turnover time.
  • Fair sharing of resources – one of the main goals of the scheduling system is to ensure that all users get allocated an equal share of the available resources. For example, when someone has already ran a lot of jobs, the scheduling system pushes their still-pending jobs a little bit down the queue to allow others to get their fair share of the available resources.

In an effort to adhere to these principles, the limits below have set on different scheduling queues.

Note

All of the limits can changed. If you have a good reason to change some of these limits, please write to support@hpc.ut.ee .

Rocket cluster

  • Unless specified otherwise, all jobs get allocated 1 node with 1 CPU core and 2 GB of memory.
  • Rocket cluster has 5 different partitions. To run jobs you need to specify the desired partition and time limit.
Partition MaxWalltime Nodes CPU cores Comments
testing 2:00:00 stage1-stage2 40 Only for short testing jobs.
Intel 8-00:00:00
8 days
bfr1-4,sfr1-12 1600 Main job queue.
long 30-00:00:00
30 days
stage75-stage131 1140 Long-running jobs.
GPU 8-00:00:00
8 days
falcon1-falcon2 96 Jobs utilizing GPUs.
AMD 8-00:00:00
8 days
ares1-ares20 5128 Nodes using AMD EPYC 7702 Processors
main 8-00:00:00
8 days
ares1-ares20 5128 Nodes using AMD EPYC 7702 Processors

As you can see, there is some overlap between the partitions. The CPU cores column shows how many cores at maximum are available for all jobs using that particular partition. You can also see the partitions on the head node by issuing the command:

scontrol show partition
By default, the maximum number of CPU cores limit per user is 1000. You can also see the individual limits for an user by issuing the command:
sacctmgr show association where user=<username>


Last update: 2023-08-14
Created: 2022-04-28