Monitoring jobs¶

Different Slurm commands provide information about jobs/job steps on different levels. The command squeue provides high-level information about jobs in the Slurm scheduling queue (state information, allocated resources, runtime, etc.). The command sstat provides detailed usage information about running jobs, and sacct provides accounting information about active and completed (past) jobs. The command scontrol provides even more detailed information about jobs and job steps.

The output format of most commands is highly configurable to your needs. Look for the --format or --Format options.

Most command options support a short form as well as a long form (for example -u \<username>, and --user=\<username>). Since some options only support long form, all instructions and examples in this documentation are in long form.

Sending email on job state change¶

Slurm allows you to send an email to an email address when the job changes state. For an example, when they fail or complete. Configuring this requires two options in the Slurm job script:

--mail-user=<email address>
--mail-type=NONE,BEGIN,END,FAIL,REQUEUE,INVALID_DEPEND,TIME_LIMIT,TIME_LIMIT_80,TIME_LIMIT_90,TIME_LIMIT_50,ARRAY_TASKS,ALL

The mail type option can be any amount of the valid values, delimited by a comma, or set to ALL for all of them. The mail type options mean the following:

NONE - Don't send an email for any reason. Default value.
BEGIN - Send an email when the job starts.
END - Send an email when the job ends for any reason.
FAIL - Send an email when the job ends with a failure.
REQUEUE - Send an email when the job is ’requeued’.
INVALID_DEPEND - Send an email when the job depends on another, invalid job.
TIME_LIMIT - Send an email when the time limit is up.
TIME_LIMIT_80 - Send an email when 80% of the time limit is up.
TIME_LIMIT_90 - Send an email when 90% of the time limit is up.
TIME_LIMIT_50 - Send an email when 50% of the time limit is up.
ARRAY_TASKS - Send an email for each array task.
ALL - Send all email messages - all preceding.

Info

Because each state change sends out a new email every time, the amount of emails sent is very high. This means the global reputation of the sending server isn't very good. Please either whitelist the sender address or keep in mind, that the emails can end up in the ’Spam’ folder, especially when sent to a non-UT email address.

Command ’squeue’¶

Use the squeue command to get a high-level overview of all active jobs in the cluster. Active jobs are currently running or pending jobs.

Syntax:

squeue [options]

Common options:

--user=<user[,user[,...]]> - Request jobs from a comma separated list of users.
--jobs=<job_id[,job_id[,...]]> - Request to display specific jobs.
--partition=<part[,part[,...]]> - Request to display specific jobs from a comma separated list of partitions.
--states=<state[,state[,...]]> - Display jobs in specific states. Comma separated list or ’all’. Default is ’PD,R,CG’.

The default output format is as follows:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

Where:

JOBID - Job or step ID. For array jobs, the job ID format is in the form <job_id>_<index>.
PARTITION - Partition of the job/step.
NAME - Name of the job/step.
USER - Owner of the job/step.
ST - State of the job/step. See below for a description of the most common states.
TIME - Time used by the job/step. Format is days-hours:minutes:seconds. Days and hours only printed if needed.
NODES- Number of nodes allocated to the job or the minimum amount of nodes required by a pending job.
NODELIST(REASON) - For pending jobs: ’Reason why pending’. For failed jobs: ’Reason why failed’. For all other job states: ’List of allocated nodes’. See below for a list of the most common reason codes.

You can easily tailor the output format of queue to your own needs. Use the --format (-o)_ or _--Format (-O) options to request a comma separated list of job information to displayed. See the man command for more information: man squeue.

Job states¶

During its lifetime, a job passes through several states. The most common states are ’PENDING’, ’RUNNING’, ’SUSPENDED’, ’COMPLETING’, and ’COMPLETED’.

PD - Pending. Job is waiting for resource allocation.
R - Running. Job has an allocation and is running.
S - Suspended. Execution has suspended and resources have released for other jobs.
CA - Cancelled. Job was explicitly cancelled by the user or the system administrator.
CG - Completing. Job is in the process of completing. Some processes on some nodes may still be active.
CD - Completed. Job has terminated all processes on all nodes with an exit code of zero.
F - Failed. Job has terminated with non-zero exit code or other failure condition.

Why is job still pending?¶

The ’REASON’ column of the squeue output gives you a hint why your job is pending and not running.

’(Resources)’

The job is waiting for resources to become available so that it can fulfill the job's resource request.
’(Priority)’

The job isn't allowed to run because at least one higher prioritized job is waiting for resources. This means the jobs is waiting in the queue
’(Dependency)’

The job is waiting for another job to finish first. --dependency=... option.
’(DependencyNeverSatisfied)’

The job is waiting for a dependency that's never satisfied. Such a job is pending forever. Please cancel such jobs.
’(ReqNodeNotAvail, UnavailableNodes:...)’

Some node required by the job is currently not available. The node may currently be in use, reserved for another job, in an advanced reservation, ’DOWN’, ’DRAINED’, or not responding. Most probably there is an active reservation for all nodes due to an upcoming maintenance downtime and your job isn't able to finish before the start of the downtime. Another reason why you should specify the duration of a job with --time as accurately as possible. Your job starts after the downtime has finished. You can list all active reservations using scontrol show reservation.

Can't submit future jobs?¶

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Means that you have reached the maximum of allowed number of jobs in a specific partition.

Examples¶

List all currently running jobs of user foo:

squeue --user=foo --states=PD,R

List all currently running jobs of user ’foo’ in partition ’bar’:

squeue --user=foo --partition=bar --states=R

Scontrol¶

Use the scontrol command to see more detailed information about a job.

Syntax:

scontrol [options] [command]

Examples¶

Show detailed information about job with ID ’500’:

scontrol show jobid 500

Show even more detailed information about job with ID ’500’. More detailed information includes also the job script:

scontrol -dd show jobid 500

Sacct¶

Use the sacct command to query information about past jobs.

Syntax:

sacct [options]

Common options:

--endtime=end_time - Select jobs in any state before the specified time.
--starttime=start_time - Select jobs in any state after the specified time.
--state=state[,state[,...]] - Select jobs based on their state during the time period given. By default, the start and end times are the current time you see only currently running jobs.