Falcon - Mid-Range NVIDIA GPU

Falcon came online in 2024 and has 52 nodes, 3,328 CPU cores, 26 TB RAM, and 208 GPUs (128 NVIDIA A30 GPUs and 80 NVIDIA L40S GPUs).

Overview

L40S Nodes

A30

Totals

Vendor

Dell PowerEdge R760XA

Dell PowerEdge R760XA

GPUs

4x NVIDIA L40S

4x NVIDIA A30

208 GPUs

GPU interconnect

not available

available

Nvidai Compute Capability

8.9

8.0

CPU

Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz

Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz

Nodes

20

32

52

Cores/Node

64

64

Memory (GiB)/Node

512

512

26.6TB

GPU Memory (GiB)/Node

192 GB (48GB/GPU)

96 GB (24GB/GPU)

6,912GB

Local Disk

1.7TB NVMe drive for /localscratch

1.7TB NVMe drive for /localscratch

Node Interconnect

200Gbps NDR Infiniband

200Gbps NDR Infiniband

Intra-node GPU Interconnect

none

pairwise Nvlink 600GB/s: 0-1 and 2-3

Total Memory

10,240 GB

16,384 GB

Total Cores

1,280

2,048

3,328

Theoretical Peak

7,328 TFLOPS FP32 (no FP64 support)

665.6 TFLOPs FP64, 1,331.2 TFLOPS FP32

Get Started

Falcon can be accessed via one of the two login nodes:

falcon1.arc.vt.edu falcon2.arc.vt.edu

For testing purposes, all users will be alloted 240 core-hours each month in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT).

To do this, log in to the ARC allocation portal https://coldfront.arc.vt.edu,

  • select or create a project

  • click the “+ Request Resource Allocation” button

  • Choose the “Compute (Free) (Cluster)” allocation type

Policies

Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications. These are the limits applied to free tier usage:

Policies for Main Usage Queues/Partitions

The l40s_normal_q, and a30_normal_q are the partitions (queues) that handle the bulk of utilization on the Falcon cluster.

l40s_normal_q

a30_normal_q

Node Type

L40S

A30

Number of Nodes

18

30

MaxRunningJobs (User)

12

12

MaxSubmitJobs (User)

24

24

MaxRunningJobs (Allocation)

24

24

MaxSubmitJobs (Allocation)

48

48

MaxGPUs (User)

40

40

MaxGPUs (Allocation)

40

40

MaxWallTime

6 days

6 days

Priority (QoS)

1,000

1,000

Policies for Development and Alternative Usage Queues/Partitions

The “dev” partitions (queues) overlap the main usage queues above, but jobs in these queues get higher priority to allow more rapid access to resources for testing and development workloads. The tradeoff is that individuals may only run a small number of short jobs in these partitions.

l40s_dev_q

a30_dev_q

Node Type

L40S

A30

Number of Nodes

20

32

MaxRunningJobs (User)

2

2

MaxSubmitJobs (User)

4

4

MaxRunningJobs (Allocation)

4

4

MaxSubmitJobs (Allocation)

8

8

MaxGPUs (User)

40

40

MaxGPUs (Allocation)

40

40

MaxWallTime

2 hours

2 hours

Priority (QoS)

2,000

2,000

Changes compared to previous clusters

Use miniconda modules instead of Anaconda

Anaconda terms of service include protections for the use of their distribution and their curated set of packages which are the default channel which conda searches when building software environments. However, the main research utility of the distributions remains protected via open-access license. This includes the use of the conda package and environment manager software and the conda-forge channel which provides the vast majority of packages people need for academic and research applications.

The use of miniconda provides full functionality with existing environments and the same set of core tools. In our tests, it generally seems to perform faster too. So we highly recommend switching from Anaconda to miniconda and ARC will no longer provide a centrally installed module for Anaconda.

/scratch high-performance scratch storage

We are re-establishing naming for some standard storage systems on clusters. The scratch file system on Falcon is mounted at /scratch. Earlier clusters had used /globalscratch to distinguish from this storage target from /localscratch devices on the individual compute nodes and to designate that this filesystem is available anywhere on the cluster. But /globalscratch is not “global” in the multi-cluster sense and the prefix has been dropped.

On Falcon, use "/scratch" instead of "/globalscratch". 

On Falcon, /scratch is a high-performance, flash-based, shared scratch storage system accessible via the Infiniband interconnect on all compute nodes. It has a 90-day aging policy which means that files older than 90 days are subject to automatic deletion. This makes it an excellent storage location for staging jobs and data for jobs.

Single Software Stack for Two Node Types

When ARC provides software via modules on clusters, we usually end up with different several separate software “stacks” for the various node types. This is done so that each stack can be optimized for the node architecture where it will be used. But both of the node types on Falcon share the same OS, GPU drivers, and CPU microarchitecture, so we have deployed a single app stack which should enable a more seamless experience.

It is

Slurm GPU to CPU bindings

Optimal application performance for GPU accelerated workloads requires that processes launched on the nodes run on CPU core topologically closest to the GPU that the process will use. On Falcon, Slurm is aware of which sets of CPU cores and memory locations have the most direct connection to each GPU. The arrangement is slightly unintuitive:

GPU device bus ID

GPU device

NUMA node

CPU cores

4a:00.0

0 - /dev/nvidia0

1

16-31

61:00.0

1 - /dev/nvidia1

0

0-15

ca:00.0

2 - /dev/nvidia2

3

48-63

e1:00.0

3 - /dev/nvidia3

2

32-47

If we do not inform Slurm of this affinity, then nearly all jobs will have reduced performance due to misalignment of allocated cores and GPUs. By default, these cores will be preferred by Slurm for scheduling with the affiliated GPU device, but other arrangements are possible.

  • Use the option --gres-flags=enforce-binding to require Slurm to allocate affiliated CPU core(s) with the corresponding GPU device(s)

  • The option --gres-flags=disable-binding is required to allocate more CPU cores than are bound to a device, but this is discouraged because these core will then be unavailable to their correctly affiliated GPU.

To summarize, these nodes and the Slurm scheduling algorithms will operate most efficiently when jobs consistently request between 1-16 cores per GPU device. For example:

Do this: --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=16 --gres-flags=enforce-binding which allocates 1 GPU, the associated 16 CPU cores, and 128GB of system memory.

Do not do this: --gres=gpu:1 --exclusive which allocates all the CPU core and all the system memory to the job, but only one GPU device. The other 3 GPUs will be unavailable to your job and also unavailable to other jobs.

Do not do this: --gres=gpu:1 --ntasks-per-node=32 which allocates 256GB of system memory, one GPU device + its 16 affiliated CPU cores AND 16 additional CPUs that have affinity to a different GPU. This isolated GPU is still available to other jobs, but can only run with diminished performance.