Skip to content

Job limits and queueing

Introduction

Every job submitted to the scheduling system is subject to resource limits. This is necessary to prevent, for example, one user from taking up all the resources of the entire cluster for an indefinite time period. In general, UTHPC tries to specify the limits based on the following guidelines:

  • The number of long-running jobs should be minimal - long-running jobs that don't have checkpoints and can't resume work in the event of an interruption, such as a system failure, are generally inefficient and error-prone, and can easily result in a significant amount of wasted CPU time.
  • Shorter jobs should have greater priority – short jobs allow the scheduling system to place the work of different users on available resources more optimally, therefore increasing the overall throughput of the cluster.
  • Parallel jobs have greater priority – since parallel jobs are essentially what compute clusters are for, they get prioritized as much as possible. Since they can also be difficult to test and debug, and may need multiple restarts for example due to changing a parameter, this prioritization also shortens their turnover time.
  • Fair sharing of resources – one of the main goals of the scheduling system is to ensure that all users get allocated an equal share of the available resources. For example, when someone has already ran a lot of jobs, the scheduling system pushes their still-pending jobs a little bit down the queue to allow others to get their fair share of the available resources.

In an effort to adhere to these principles, the limits below have set on different scheduling queues.

Note

All of the limits can changed. If you have a good reason to change some of these limits, please write to support@hpc.ut.ee .

Rocket cluster

  • Unless specified otherwise, all jobs get allocated 1 node from the main partition, along with 1 CPU core and 2 GB of memory.
  • Rocket cluster has 5 different partitions. To run jobs you need to specify the desired partition and time limit.
Partition MaxWalltime Nodes CPU cores Comments
testing 2:00:00
2 hours
ares20 128 Only for short testing jobs.
Intel 8-00:00:00
8 days
bfr1-4, sfr1-12 640 Nodes using Intel Xeon Gold 6138 processors.
long 30-00:00:00
30 days
bfr3-4, sfr9-12 240 Long-running jobs.
GPU 8-00:00:00
8 days
falcon1-6, pegasus, pegasus2 832 Jobs utilizing GPUs.
AMD 8-00:00:00
8 days
ares1-20, artemis1-20 5120 Nodes using AMD EPYC 7702 processors
main 8-00:00:00
8 days
ares1-20, artemis1-20 5120 Alias for the AMD partition

As you can see, there is some overlap between the partitions. The CPU cores column shows how many cores at maximum are available for all jobs using that particular partition. You can also see information about the partitions by issuing the command:

scontrol show partition
By default, the maximum number of CPU cores limit per user is 600. You can also see the individual limits for a user by issuing the command:
sacctmgr show association where user=<username>