Job limits and queueing

Introduction¶

Every job submitted to the scheduling system is subject to resource limits. This is necessary to prevent, for example, one user from taking up all the resources of the entire cluster for an indefinite time period. In general, UTHPC tries to specify the limits based on the following guidelines:

The number of long-running jobs should be minimal - long-running jobs that don't have checkpoints and can't resume work in the event of an interruption, such as a system failure, are generally inefficient and error-prone, and can easily result in a significant amount of wasted CPU time.
Shorter jobs should have greater priority – short jobs allow the scheduling system to place the work of different users on available resources more optimally, therefore increasing the overall throughput of the cluster.
Parallel jobs have greater priority – since parallel jobs are essentially what compute clusters are for, they get prioritized as much as possible. Since they can also be difficult to test and debug, and may need multiple restarts for example due to changing a parameter, this prioritization also shortens their turnover time.
Fair sharing of resources – one of the main goals of the scheduling system is to ensure that all users get allocated an equal share of the available resources. For example, when someone has already ran a lot of jobs, the scheduling system pushes their still-pending jobs a little bit down the queue to allow others to get their fair share of the available resources.

In an effort to adhere to these principles, the limits below have set on different scheduling queues.

Note

All of the limits can changed. If you have a good reason to change some of these limits, please write to support@hpc.ut.ee .

Rocket cluster¶

Unless specified otherwise, all jobs get allocated 1 node from the main partition, along with 1 CPU core and 2 GB of memory.
Rocket cluster has 5 different partitions. To run jobs you need to specify the desired partition and time limit.

Partition	MaxWalltime	Nodes	CPU cores	Comments
testing	2:00:00 2 hours	ares20	128	Only for short testing jobs.
Intel	8-00:00:00 8 days	bfr1-4, sfr1-12	640	Nodes using Intel Xeon Gold 6138 processors.
long	30-00:00:00 30 days	bfr3-4, sfr9-12	240	Long-running jobs.
GPU	8-00:00:00 8 days	falcon1-6, pegasus, pegasus2	832	Jobs utilizing GPUs.
AMD	8-00:00:00 8 days	ares1-20, artemis1-20	5120	Nodes using AMD EPYC 7702 processors
main	8-00:00:00 8 days	ares1-20, artemis1-20	5120	Alias for the AMD partition

As you can see, there is some overlap between the partitions. The CPU cores column shows how many cores at maximum are available for all jobs using that particular partition. You can also see information about the partitions by issuing the command:

scontrol show partition

By default, the maximum number of CPU cores limit per user is 600. You can also see the individual limits for a user by issuing the command:

sacctmgr show association where user=<username>

Network restrictions¶

When running jobs that will serve content over the network inside a cluster job, HPC firewall rules must be taken into consideration. Jobs are able to bind to any port inside the ephemeral port range, however only a small subset can be accessed from inside eduroam or the university VPN network.

Currently, the ports open in the firewall are in the range of 8000-8080. For example, running Your own Jupyter notebook instance, specify --port=8001 to serve the notebook on a port that is accessible.