Falcon - Mid-Range NVIDIA GPU
Falcon came online in 2024 and has 52 nodes, 3,328 CPU cores, 26 TB RAM, and 208 GPUs (128 NVIDIA A30 GPUs and 80 NVIDIA L40S GPUs).
Overview
L40S Nodes |
A30 |
Totals |
|
---|---|---|---|
Vendor |
Dell PowerEdge R760XA |
Dell PowerEdge R760XA |
|
GPUs |
4x NVIDIA L40S |
4x NVIDIA A30 |
208 GPUs |
GPU interconnect |
not available |
available |
|
Nvidai Compute Capability |
8.9 |
8.0 |
|
CPU |
Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz |
Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz |
|
Nodes |
20 |
32 |
52 |
Cores/Node |
64 |
64 |
|
Memory (GiB)/Node |
512 |
512 |
26.6TB |
GPU Memory (GiB)/Node |
192 GB (48GB/GPU) |
96 GB (24GB/GPU) |
6,912GB |
Local Disk |
1.7TB NVMe drive for |
1.7TB NVMe drive for |
|
Node Interconnect |
200Gbps NDR Infiniband |
200Gbps NDR Infiniband |
|
Intra-node GPU Interconnect |
none |
pairwise Nvlink 600GB/s: 0-1 and 2-3 |
|
Total Memory |
10,240 GB |
16,384 GB |
|
Total Cores |
1,280 |
2,048 |
3,328 |
Theoretical Peak |
7,328 TFLOPS FP32 (no FP64 support) |
665.6 TFLOPs FP64, 1,331.2 TFLOPS FP32 |
Get Started
Falcon can be accessed via one of the two login nodes:
falcon1.arc.vt.edu
falcon2.arc.vt.edu
For testing purposes, all users will be alloted 240 core-hours each month in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT).
To do this, log in to the ARC allocation portal https://coldfront.arc.vt.edu,
select or create a project
click the “+ Request Resource Allocation” button
Choose the “Compute (Free) (Cluster)” allocation type
Policies
Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications. These are the limits applied to free tier usage:
Policies for Main Usage Queues/Partitions
The l40s_normal_q
, and a30_normal_q
are the partitions (queues) that handle the bulk of utilization on the Falcon cluster.
l40s_normal_q |
a30_normal_q |
|
---|---|---|
Node Type |
L40S |
A30 |
Number of Nodes |
18 |
30 |
MaxRunningJobs (User) |
12 |
12 |
MaxSubmitJobs (User) |
24 |
24 |
MaxRunningJobs (Allocation) |
24 |
24 |
MaxSubmitJobs (Allocation) |
48 |
48 |
MaxGPUs (User) |
40 |
40 |
MaxGPUs (Allocation) |
40 |
40 |
MaxWallTime |
6 days |
6 days |
Priority (QoS) |
1,000 |
1,000 |
Policies for Development and Alternative Usage Queues/Partitions
The “dev” partitions (queues) overlap the main usage queues above, but jobs in these queues get higher priority to allow more rapid access to resources for testing and development workloads. The tradeoff is that individuals may only run a small number of short jobs in these partitions.
l40s_dev_q |
a30_dev_q |
|
---|---|---|
Node Type |
L40S |
A30 |
Number of Nodes |
20 |
32 |
MaxRunningJobs (User) |
2 |
2 |
MaxSubmitJobs (User) |
4 |
4 |
MaxRunningJobs (Allocation) |
4 |
4 |
MaxSubmitJobs (Allocation) |
8 |
8 |
MaxGPUs (User) |
40 |
40 |
MaxGPUs (Allocation) |
40 |
40 |
MaxWallTime |
2 hours |
2 hours |
Priority (QoS) |
2,000 |
2,000 |
Recommended Uses
The nodes selected for this cluster are intended to provide broad utility for a wide range of GPU-enabled applications. While the CPUs on this cluster are excellent CPUs, any applications which are CPU-only will probably be better served by our CPU-centric resources such as the Owl cluster.
The L40S GPUs should provide excellent AI/ML inference and training capability for models which can either fit onto a single GPU’s 48GB of device memory or be spread across multiple GPUs. The lack of support for double-precision floating-point arithmetic (FP64) on the GPUs, however, makes them unsuitable for most traditional “HPC” applications.
The A30 nodes do support FP64 and they will be ideal for GPU-accelerated software for applications such as fluid-dynamics, computational chemistry, and multiphysics simulations. They are also capable for AI/ML inference and training for smaller models. With the 24GB device memory and updated architecture, they should provide marginally performance for these applications comparable to V100 GPUs.
Changes compared to previous clusters
Use miniconda
modules instead of Anaconda
Anaconda terms of service include protections for the use of their distribution and their curated set of packages which are the default
channel which conda
searches when building software environments. However, the main research utility of the distributions remains protected via open-access license. This includes the use of the conda
package and environment manager software and the conda-forge
channel which provides the vast majority of packages people need for academic and research applications.
The use of miniconda
provides full functionality with existing environments and the same set of core tools. In our tests, it generally seems to perform faster too. So we highly recommend switching from Anaconda to miniconda
and ARC will no longer provide a centrally installed module for Anaconda.
/scratch
high-performance scratch storage
We are re-establishing naming for some standard storage systems on clusters. The scratch file system on Falcon is mounted at /scratch
. Earlier clusters had used /globalscratch
to distinguish from this storage target from /localscratch
devices on the individual compute nodes and to designate that this filesystem is available anywhere on the cluster. But /globalscratch
is not “global” in the multi-cluster sense and the prefix has been dropped.
On Falcon, use "/scratch" instead of "/globalscratch".
On Falcon, /scratch
is a high-performance, flash-based, shared scratch storage system accessible via the Infiniband interconnect on all compute nodes. It has a 90-day aging policy which means that files older than 90 days are subject to automatic deletion. This makes it an excellent storage location for staging jobs and data for jobs.
Single Software Stack for Two Node Types
When ARC provides software via modules on clusters, we usually end up with different several separate software “stacks” for the various node types. This is done so that each stack can be optimized for the node architecture where it will be used. But both of the node types on Falcon share the same OS, GPU drivers, and CPU microarchitecture, so we have deployed a single app stack which should enable a more seamless experience.
It is
Slurm GPU to CPU bindings
Optimal application performance for GPU accelerated workloads requires that processes launched on the nodes run on CPU core topologically closest to the GPU that the process will use. On Falcon, Slurm is aware of which sets of CPU cores and memory locations have the most direct connection to each GPU. The arrangement is slightly unintuitive:
GPU device bus ID |
GPU device |
NUMA node |
CPU cores |
---|---|---|---|
4a:00.0 |
0 - /dev/nvidia0 |
1 |
16-31 |
61:00.0 |
1 - /dev/nvidia1 |
0 |
0-15 |
ca:00.0 |
2 - /dev/nvidia2 |
3 |
48-63 |
e1:00.0 |
3 - /dev/nvidia3 |
2 |
32-47 |
If we do not inform Slurm of this affinity, then nearly all jobs will have reduced performance due to misalignment of allocated cores and GPUs. By default, these cores will be preferred by Slurm for scheduling with the affiliated GPU device, but other arrangements are possible.
Use the option
--gres-flags=enforce-binding
to require Slurm to allocate affiliated CPU core(s) with the corresponding GPU device(s)The option
--gres-flags=disable-binding
is required to allocate more CPU cores than are bound to a device, but this is discouraged because these core will then be unavailable to their correctly affiliated GPU.
To summarize, these nodes and the Slurm scheduling algorithms will operate most efficiently when jobs consistently request between 1-16 cores per GPU device. For example:
Do this: --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=16 --gres-flags=enforce-binding
which allocates 1 GPU, the associated 16 CPU cores, and 128GB of system memory.
Do not do this: --gres=gpu:1 --exclusive
which allocates all the CPU core and all the system memory to the job, but only one GPU device. The other 3 GPUs will be unavailable to your job and also unavailable to other jobs.
Do not do this: --gres=gpu:1 --ntasks-per-node=32
which allocates 256GB of system memory, one GPU device + its 16 affiliated CPU cores AND 16 additional CPUs that have affinity to a different GPU. This isolated GPU is still available to other jobs, but can only run with diminished performance.