Cluster best practices¶
If in doubt, contact support¶
If in doubt, contact firstname.lastname@example.org for info.
UTHPC team is here for a reason. That reason is to help you and your projects along and provide as seamless cluster use experience as possible.
Profile and test your jobs¶
Profile and test your jobs by running a small sample set before sending 5k into the queue.
There might be issues with your initial jobs, small typos or script errors. These problems reveal themselves with a small sample set, but you won't be congesting the cluster nor sacrificing your own fair share priority.
Don't use the head node for calculations¶
The head node is for facilitating interactions with the queue, and for data movement. Please don't run any kind of processing on the head node, as this negatively impacts other users.
This also means if you, for an example, need to compile software. Use
srun for that.
For persistent jobs don't use screen/srun¶
Both of these tools depend on the node you ran the job from. This means that if you run screen/srun from
rocket.hpc.ut.ee gets restarted, the jobs fails. And UTHPC nodes do get restarted quite often.
Instead try to run as much as possible with
sbatch. This makes sure that the job is only dependant on the compute node the job runs on, and even if something happens to it, the job gets restarted on another node.
Don't compress/uncompress data in parallel¶
Please don't compress/uncompress data in parallel and especially across multiple nodes.
Compressions, for example zip, tar, 7z, have a very specific IO fingerprints which doesn't really work well together with parallel file systems, like the ones used between UTHPC nodes. Having more than one of these compression processes at once doesn't really speed up the process, but it does cause read or write queues on the node which might impact the whole cluster.
Instead, if you are in a hurry or have a massive amount of data, first copy the data to
/tmp folder on a compute node and do your compression related tasks there. Then move the data back to the parallel file system.
Queues are normal¶
The cluster runs a queue system. Queues are normal.
Please don't get disheartened when your job doesn't get allocated instantly. Queue times are usually under a few hours, so the jobs should start rather quickly. Some high performance computing centers have queue times in weeks.
If you're in a hurry due to deadlines or some other reason, one option is to ask to bump up your priority in the queue. If this is persistently a problem, then you can also use the https://minu.etais.ee cloud to buy a virtual machine for your own use. It's more expensive, because the allocated resource is only for you and constantly, but there's no queue.
Clean temporary files¶
Finding yourself out of disk space quota? Clean temporary files.
Often enough users generate hundreds of gigabytes or even terabytes worth of temporary files, that fill up the person's quota, without cleaning the files afterwards. Temporary files also count towards the quota, and if you fill your quota's hard limit during job execution, you won't be able to write data, which can cause job failures.
A better option is to use the
/tmp folder of nodes for temporary data, when possible. Obviously with multi-node jobs this might not be, but a single node job should be able to use
/tmp for the temporary data, and then delete it before the job finishes.
Of course, if you need to keep the temporary files, UTHPC team is always happy to sell you more quota.