Slurm Overview and Quick Reference
SLURM is the “Simple Linux Utility for Resource Management” and is an open-source bundle of software use by ARC and many other research computing centers. It provides several critical functions:
Resource management - monitoring cluster resources, allocating them to user workloads, and keeping track of states
Scheduling workloads - accepting resource requests from users, managing a queue and dynamic job prioritization, lauching jobs on compute nodes
Accounting - Tracking user and group usage, enforcing defined resource usage limits, reporting functionality
Cluster Terminology
Cluster - A set of computing resources with centralized management.
Login node - A publicly accessible computer which serves as an entry point to access cluster resources. As a shared resource for many users, login nodes are not suitable for intensive workloads.
Compute Node - A physical computer which is one of several or many identically configured computers in a cluster. Access to run workloads on computer nodes is usually controlled by a resource manager. Discrete sets of resources on computer nodes (e.g. CPUs, memory, GPUs) are usually made exclusively available to one exclusively to one job at a time.
Partition - A set of nodes which are grouped together to define a resource pool which is a target for jobs.
Job - A request for resources, usually to run a workload.
Queue - The list of jobs waiting for resources to become available.
Cluster Inspection and Status
command |
scope |
---|---|
|
“View information about Slurm nodes and partitions” |
|
“View information about jobs located in the Slurm scheduling queue.” |
|
“View or modify Slurm configuration and state.” |
sinfo
has many options to provide different information. The -s
option provides concise list of cluster partitions and status:
[user@owl1 ~]$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
normal_q* up infinite 81/0/1/82 owl[001-082]
dev_q up infinite 81/2/1/84 owl[001-084]
preemptable_q up infinite 81/2/1/84 owl[001-084]
largemem_q up infinite 2/0/0/2 owl-hm[001-002]
hugemem_q up infinite 1/0/0/1 owl-hm003
test_q up infinite 0/1/0/1 owltest01
interactive_q up infinite 0/4/0/4 owlmln[001-004]
This can help identify the cluster partitions and node statuses where A/I/O/T
for “Available/Idle/Other/Total”.
More information about each cluster, the node types, partitions, and usage limits can be found on cluster resource pages.
Jobs: Requesting resources
Batch
command |
scope |
---|---|
|
Interactive
command |
scope |
---|---|
|
|
|
|
|
Most commonly used job configuration options
The three resource request commands share a common set of options which provide a plethora of ways to set up and configure jobs. The manuals provide exhaustive information, but here are the most commonly used options with brief explanations:
option |
default |
function |
notes |
---|---|---|---|
|
n/a |
name of Slurm billing account |
this is the only mandatory option |
|
1 |
how many nodes |
extending jobs to multiple nodes requires software orchestration |
|
|
select the partition to use |
|
|
n/a |
how many concurrent tasks you want |
not recommended for multi-node jobs |
|
1 |
number of concurrent tasks to expect on each node |
provides better control than |
|
1 |
number of cores to allocate to each task |
affects task to cpu binding |
|
usually 30 min., but can vary by partition |
format is |
|
Open OnDemand
When you use Interactive Apps in Open OnDemand, you are triggering precomposed batch
Job status and control
command |
scope |
---|---|
|
display your jobs which are currently pending or running |
|
displays accounting data for jobs of all states, but by default only today’s jobs |
|
display job efficiency information for completed jobs |
|
display node-level resource usage information for a running job |
|
display job resource status for running job steps (advanced) |
|
show full job information for a pending or running job |
|
request that Slurm immediate terminate a running job |
|
make a direct SSH connection to a node where you have a running job |
Accounting
command |
scope |
---|---|
|
print summary information about all your active Slurm accounts and storage allocations |
|
show all jobs run in the specified account since the specified date |
|
|
|
|
sacct salloc sbatch scancel scrontab sdiag sgather sinfo sjstat sprio sreport sshare strigger |
|
sacctmgr sattach sbcast scontrol scrun seff sh5util sjobexitmod smail squeue srun sstat sview |