Slurm Workload Manager

Slurm is an open-source resource manager and job scheduler originally created by people at the Livermore Computing Center and now installed in many of the Top500 supercomputers, including XStream.

Submit a job

Slurm supports a variety of job submission techniques. By accurately requesting the resources you need, you’ll be able to get your work done.

A job consists in two parts: resource requests and job steps. Resource requests consist in a number of CPUs, GPUs, computing expected duration, amount of memory, etc. Job steps describe tasks that must be done, software which must be run.

The typical way of creating a job is to write a submission script. A submission script is a shell script, e.g. a Bash script, whose first comments, if they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage (man sbatch).

Slurm will ignore all lines after the first blank line, even the ones containing SBATCH. Always put your SBATCH parameters at the top of your batch script.

The script itself is a job step. Other job steps are created with the srun command.

For instance, the following script, hypothetically named submit.sh,

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res_%j.txt
#
#SBATCH --time=10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=500
#SBATCH --gres gpu:1

srun hostname
srun sleep 60

would request one task with one CPU and one GPU for 10 minutes, along with 500 MB of RAM, in the default partition. When started, the job would run a first job step srun hostname, which will launch the command hostname on the node on which the requested CPU was allocated. Then, a second job step will start the sleep command.

Once the submission script is written properly, you need to submit it to slurm through the sbatch command, which, upon success, responds with the jobid attributed to the job.

$ sbatch submit.sh
Submitted batch job 4011

The job then enters the queue in the PENDING state. Once resources become available and the job has highest priority, an allocation is created for it and it goes to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state, otherwise, it is set to the FAILED state.

Upon completion, the output file contains the result of the commands run in the script file. In the above example, you can see it with cat res.txt.

Note that you can create an interactive job with the salloc command or by issuing a srun command directly.

GPU IDs

When requesting GPUs with the option --gres gpu:N of srun or sbatch (not salloc), Slurm will set the CUDA_VISIBLE_DEVICES environment variable to store the GPU ids that have been allocated to the job. So for instance, with --gres gpu:2, depending on the current state of the node GPUs, CUDA_VISIBLE_DEVICES could be set to "0,1", meaning that you will be able to use GPU 0 and GPU 1. Most applications automatically detect the existence of CUDA_VISIBLE_DEVICES and run on the allocated GPUs, but some don’t and allow to explicitly set GPU ids, which would need to be done manually.

GPU Compute Mode

The GPU compute mode for XStream is set to Exclusive Process by default. Only a single MPI rank or process is able to access the GPU context in this mode. Many threads within a process may still access the GPU context. In some cases, the Default GPU Compute Mode is required instead, for example for debugging or profiling. In the Default Mode, multiple MPI ranks or processes are able to access the GPU. Unless you know what you’re doing, the GPU Default Compute Mode is not recommended on XStream.

To enable the Default GPU Compute Mode on XStream, please add the following Slurm constraint to your batch script:

#SBATCH -C gpu_shared

Note: The Default Mode is currently required for Amber (PMEMD) when using peer to peer communication.

Limits/QoS

The table below shows the current job limits per Slurm QoS:

Slurm QoS Max CPUs Max GPUs Max Jobs Max Nodes Job time limits
normal (default) 320/user 400/group 256/user 320/group 512/user 16/user 20/group Default: 2 hours
Max: 2 days
long 20/user 80/group
200 max total
16/user 64/group
160 max total
4/user
64 max total
Default: 2 hours
Max: 7 days

To switch to the long QoS, please add --qos=long to your command or batch script.

You may also retrieve the above scheduler limits with the following command:

$ sacctmgr show qos format=Name,Flags%30,GrpTRES%30,GrpJobs,MaxJobsPerUser,MaxSubmit,MaxTRESPerAccount%30,MaxTRESPerUser%30,MaxWall

Other job submission rules

To maximize GPU efficiency on XStream, the following rules apply:

  • A maximum CPU/GPU ratio of 5/4 (20/16) is enforced.
  • The default system memory per CPU is set to 12,000 MB.
  • The max system memory per CPU is set to 12,800 MB. If you request more memory, the CPU count will automatically be updated.
  • The max system memory per GPU is set to 16,000 MB. Unlike memory/CPU, the number of GPUs is not automatically updated when you request more memory.

If the above rules are not respected, you will likely get a message like in the following example:

$ srun -c 5 --gres gpu:1 command
srun: error: CPUs requested per node (5) not allowed with only 1 GPU(s); increase the number of GPUs to 4 or reduce the number of CPUs
srun: error: Unable to allocate resources: More processors requested than permitted

The following batch script will execute Torch on 5 CPUs and 4 GPUs on the same CPU socket, which is permitted on XStream:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res_%j.txt
#
#SBATCH --time=12:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --gres gpu:4
#SBATCH --gres-flags=enforce-binding

ml torch/20160805-4bfc2da protobuf/2.6.1

th main.lua -resume ...

Cancel a job

Use the scancel jobid command with the jobid of the job you want canceled. In the case you want to cancel all your jobs, type scancel -u USER. You can also cancel all your pending jobs for instance with scancel -t PD.

Gathering information

Job information

squeue

The squeue command shows the list of jobs which are currently running (they are in the RUNNING state, noted as R) or waiting for resources (noted as PD).

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 2252    normal     job2     mike PD       0:00      1 (Dependency)
 2251    normal     job1     mike  R 1-16:18:47      1 xs-0022

The above output shows that one job is running, whose name is job1 and whose jobid is 2251. The jobid is a unique identifier that is used by many Slurm commands when actions must be taken about one particular job. For instance, to cancel job job1, you would use scancel 2251. Time is the time the job has been running until now. Node is the number of nodes which are allocated to the job, while the Nodelist column lists the nodes which have been allocated for running jobs. For pending jobs, that column gives the reason why the job is pending.

As with the sinfo command, you can choose what you want squeue to output with the --format parameter.

scontrol show job

To get full details of a pending or running job, use scontrol show job JOBID [-dd].

sstat

You can get near-realtime information about your program (memory consumption, etc.) with the sstat command (please see man sstat).

sacct

You can get the state of your finished jobs by running the following command:

$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
4011               test     normal       srcc          1  COMPLETED      0:0
4011.batch        batch                  srcc          1  COMPLETED      0:0
4011.0         hostname                  srcc          1  COMPLETED      0:0
4011.1            sleep                  srcc          1  COMPLETED      0:0

sacct is the command interface to the Slurm accounting database and has many options (please see man sacct). Here is an example of getting memory information of your recent past jobs:

$ sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize
       JobID    JobName   NTasks  NodeList     MaxRSS  MaxVMSize     AveRSS  AveVMSize
------------ ---------- -------- --------- ---------- ---------- ---------- ----------
4011               test            xs-0024        16?        16?
4011.batch        batch        1   xs-0024      1496K    150360K      1496K    106072K
4011.0         hostname        1   xs-0024          0    292768K          0    292768K
4011.1            sleep        1   xs-0024       624K    292764K       624K    100912K

Resource information

sinfo

Slurm offers a few commands with many options you can use to interact with the system. For instance, the sinfo command gives an overview of the resources offered by the cluster, while the squeue command shows to which jobs those resources are currently allocated.

By default, sinfo lists the partitions that are available. A partition is a set of compute nodes grouped logically. Typical examples include partitions dedicated to batch processing, debugging, post processing, or visualization.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 2-00:00:00      2  drain xs-[0054,0057]
normal*      up 2-00:00:00      4    mix xs-[0007,0051-53]
normal*      up 2-00:00:00     48  alloc xs-[0001-0006,0008-0009,0011-0050]
normal*      up 2-00:00:00     11   idle xs-[0010,0055-0056,0058-0065]

In the above example, we see one partition normal. This is the default partition as it is marked with an asterisk. In this example, 48 nodes of the normal partition are being used, 4 are in the mix state (partially allocated), 11 are idle (available) and 2 are drained which means some maintenance operation is taking place.

The command sinfo can also output the information in a node-oriented fashion, with the argument -N. Along with -l, it will display more information about the nodes: number of CPUs, memory, temporary disk (also called local scratch space), features of the nodes (such as processor type for instance) and the reason, if any, for which a node is down.

Node characteristics and Generic Resources (GRES)

Slurm associates to each nodes a set of Features and a set of Generic resources. Features are immutable characteristics of the node (e.g. CPU model, CPU frequency) while generic resources are consumable resources, meaning that as users reserve them, they become unavailable for the others (e.g. GPUs).

To list all node characteristics including GRES, you can use the following command:

$ scontrol show nodes

However, the output of this command is quite verbose. So you can also use sinfo to list GRES of each node using specific output parameters, for example:

$ sinfo -o "%10P %8c %8m %11G %5D %N"
PARTITION  CPUS     MEMORY   GRES        NODES NODELIST
test       20       258374   gpu:k80:16  3     xs-[0007,0051,0058]
normal*    20       258374   gpu:k80:16  62    xs-[0001-0006,0008-0050,0052-0057,0059-0065]

On XStream, all compute nodes are identical, so no Features are set, only GRES are interesting for jobs allocation as GPUs are handled there. GRES appear under the form resource:type:count. On XStream, resource is always gpu and type is k80, and count is the number of logical K80 GPUs per node (16).