Investigating a job failure¶
There are several ways to investigate a job failure, depending on the type of the failure and information one's interested in.
It's important to collect error/output messages either by writing such information to the default location or by specifying specific locations using the –error/–output
option in job files. Don't redirect the error/output stream to ’/dev/null’ unless you don't care about failures. Error and output messages are the starting point for investigating a job failure.
There are many reasons a job may fail:
- exceeding a resource allocation limit,
- software error,
- less common, but possible, is a node or node software failure.
Below is a guide on how to troubleshoot some of the causes of failures.
Exceeding resource limits¶
Each partition has its maximum allowed runtime of a job (see Cluster Partitions for more information), if you request more time than it's allowed for the given partition, you won't be able to submit the job and get a similar message:
sbatch: error: You have requested too much time or specified no time limit for your job to run on a partition.
Maximum for partition 'testing' is 120 minutes and you requested 19980 minutes
sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit)
If the job exceeded the requested --time
or -t
, the error or output file provides appropriate information:
(...)
slurmstepd: error: *** JOB 15608257 ON stage68 CANCELLED AT 2021-03-05T10:33:01 DUE TO TIME LIMIT ***
(...)
If a job exceeded the requested memory limit you are likely to see the following error. Memory limit:
(...)
slurmstepd: error: Job 15608257 exceeded memory limit (3940736 > 2068480), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 123456 ON stage1 CANCELLED AT 2021-03-02T10:21:37 ***
(...)
Note
Note that the job limits aren't carved in stone. If you have a good reason to change some of these limits, please feel free to contact us .
Software errors¶
the section is under construction
Software errors most commonly manifest as error messages from your runtime itself, there are hundreds of possibilities, ranging from kernel errors to input validation errors. There is no simple trick to figuring out the type of error and how to fix it, especially if you aren't too familiar with a concrete software. Below is a small subset of more common errors and how to approach these, but in most cases a reading trough the error message carefully is a good way to gain some general understanding
Illegal instruction¶
Illegal instruction errors can manifest in different ways, most commonly on software compiled for a specific architecture:
gcc -I"/storage/software/R/3.6.1/lib64/R/include" -DNDEBUG -I./lib/ -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c capture.c -o capture.o
gcc -I"/storage/software/R/3.6.1/lib64/R/include" -DNDEBUG -I./lib/ -I/usr/local/include -fvisibility=hidden -fpic -g -O2 -c export.c -o export.o
In file included from ./lib/rlang.h:81:0,
from export/exported.c:1,
from export.c:1:
./lib/c-utils.h: In function ‘r_ssize_as_double’:
./lib/c-utils.h:81:3: internal compiler error: Illegal instruction
"/var/spool/slurm/slurmd/job12345678/slurm_script: line 66: 1704 Illegal instruction (core dumped)
Shared library errors¶
Shared library errors usually involve one of the following error messages:
/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found
rjcommon.h:11:21: fatal error: jpeglib.h: No such file or directory
mca: base: component_find: unable to open /storage/software/openmpi-1.7.3/lib/openmpi/mca_ess_pmi: libslurm.so.27: cannot open shared object file
error while loading shared libraries: libmpi_usempi.so.20: cannot open shared object file: No such file or directory
The files missing usually refer to system software libraries with faulty links and usually refers to files with the extension .h
or .so
. It's advised to contact UTHPC support to fix these, as they usually point to a faulty software module.
No such file or directory¶
While this is a generic error message for missing files, it might be useful to see if the referred file is part of the input dataset of your job. Since filenames have to be very specific, it's a good idea to validate that your script input parameters are correct.
Disk quota exceeded¶
The general UTHPC home folder file system /gpfs/space
enforces a 2Tb file size limit along with a maximum of 1 million files per user. For increasing your quota, please contact UTHPC support. You can also view the status of your quota using the tool myqota
on rocket.hpc.ut.ee
:
[user@login1 ~] myquota
Current home quota for user user
Block(total filesize): 574.9GB/2.1TB
Inode(total filecount): 410.2K/1.0M