Slurm Overview and Quick Reference

SLURM is the “Simple Linux Utility for Resource Management” and is an open-source bundle of software use by ARC and many other research computing centers. It provides several critical functions:

  • Resource management - monitoring cluster resources, allocating them to user workloads, and keeping track of states

  • Scheduling workloads - accepting resource requests from users, managing a queue and dynamic job prioritization, lauching jobs on compute nodes

  • Accounting - Tracking user and group usage, enforcing defined resource usage limits, reporting functionality

Cluster Terminology

  • Cluster - A set of computing resources with centralized management.

  • Login node - A publicly accessible computer which serves as an entry point to access cluster resources. As a shared resource for many users, login nodes are not suitable for intensive workloads.

  • Compute Node - A physical computer which is one of several or many identically configured computers in a cluster. Access to run workloads on computer nodes is usually controlled by a resource manager. Discrete sets of resources on computer nodes (e.g. CPUs, memory, GPUs) are usually made exclusively available to one exclusively to one job at a time.

  • Partition - A set of nodes which are grouped together to define a resource pool which is a target for jobs.

  • Job - A request for resources, usually to run a workload.

  • Queue - The list of jobs waiting for resources to become available.

Cluster Inspection and Status

command

scope

sinfo

“View information about Slurm nodes and partitions”

squeue

“View information about jobs located in the Slurm scheduling queue.”

scontrol

“View or modify Slurm configuration and state.”

sinfo has many options to provide different information. The -s option provides concise list of cluster partitions and status:

[user@owl1 ~]$ sinfo -s
PARTITION     AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
normal_q*        up   infinite        81/0/1/82 owl[001-082]
dev_q            up   infinite        81/2/1/84 owl[001-084]
preemptable_q    up   infinite        81/2/1/84 owl[001-084]
largemem_q       up   infinite          2/0/0/2 owl-hm[001-002]
hugemem_q        up   infinite          1/0/0/1 owl-hm003
test_q           up   infinite          0/1/0/1 owltest01
interactive_q    up   infinite          0/4/0/4 owlmln[001-004]

This can help identify the cluster partitions and node statuses where A/I/O/T for “Available/Idle/Other/Total”.

More information about each cluster, the node types, partitions, and usage limits can be found on cluster resource pages.

Jobs: Requesting resources

Batch

command

scope

sbatch

Interactive

command

scope

salloc

srun

interact

Most commonly used job configuration options

The three resource request commands share a common set of options which provide a plethora of ways to set up and configure jobs. The manuals provide exhaustive information, but here are the most commonly used options with brief explanations:

option

default

function

notes

-A <name> or --account=<name>

n/a

name of Slurm billing account

this is the only mandatory option

-n <#> or --nodes=<#>

1

how many nodes

extending jobs to multiple nodes requires software orchestration

-p <name> or --partition=<name>

normal_q

select the partition to use

-n <#> or --ntasks=<#>

n/a

how many concurrent tasks you want

not recommended for multi-node jobs

--ntasks-per-node=<#>

1

number of concurrent tasks to expect on each node

provides better control than -n

--cpus-per-task=<#>

1

number of cores to allocate to each task

affects task to cpu binding

-t <spec> or --time=<spec>

usually 30 min., but can vary by partition

format is D-HH:MM:SS

Open OnDemand

When you use Interactive Apps in Open OnDemand, you are triggering precomposed batch

Job status and control

command

scope

squeue -u $USER

display your jobs which are currently pending or running

sacct

displays accounting data for jobs of all states, but by default only today’s jobs

seff <jobid>

display job efficiency information for completed jobs

jobload <jobid>

display node-level resource usage information for a running job

sstat <jobid>

display job resource status for running job steps (advanced)

scontrol show job --detail <jobid>

show full job information for a pending or running job

scancel <jobid>

request that Slurm immediate terminate a running job

ssh <nodename>

make a direct SSH connection to a node where you have a running job

Accounting

command

scope

quota

print summary information about all your active Slurm accounts and storage allocations

sacct -A <account> --start=YYYY-MM-DD -X

show all jobs run in the specified account since the specified date

showusage

sshare

sacct salloc sbatch scancel scrontab sdiag sgather sinfo sjstat sprio sreport sshare strigger

sacctmgr sattach sbcast scontrol scrun seff sh5util sjobexitmod smail squeue srun sstat sview