ARC System Changes: 2025-05

During mid-May, ARC systems will be offline for regular maintenance. During this time a number of major changes will be implemented. This page presents a brief outline of the changes, explains the impact it may have on your use of ARC systems, and provides a FAQ regarding these changes.

If you have questions about this, there are several ways to get more information or request help:

Partition Changes

Consolidation

ARC clusters host a wide variety of resource types because Virginia Tech researchers have a wide variety of computational needs. But having small pools of resources disconnected from each other creates isolated pockets of resources which suffer alternately from very low and very high demand. By grouping resources into larger pools, jobs will have access to more resources and this can help them start faster.

  • All CPU-only partitions within a cluster will be combined into a single partition.

  • GPU partitions will be combined when the GPU devices are of the same model.

  • Infer cluster will be shutdown. V100 and T4 GPUs will move to Falcon.

  • dev_q partitions will be removed (see info on QOS’s for priority scheduling for short jobs).

  • Users can use features (--constraint) if they need specific hardware features.

  • QOS options will let users select a QoS with different walltime, priority, and resource limits.

Cluster

Node Types

Partitions

Notes

User-selectable features

Falcon

A30 GPU nodes

a30_normal_q a30_preemptable_q

a30_dev_q removed

n/a - homogeneous partitions

Falcon

L40S GPU nodes

l40s_normal_q l40s_preemptable_q

l40s_dev_q removed

n/a - homogeneous partitions

Falcon

V100 GPU nodes

v100_normal_q v100_preemptable_q

formerly part of the Infer cluster

n/a - homogeneous partitions

Falcon

T4 GPU nodes

t4_normal_q t4_preemptable_q

formerly part of the Infer cluster

n/a - homogeneous partitions

Owl

AMD Zen4 “Genoa” nodes

normal_q preemptable_q

dev_q removed

--constraint=avx512

Owl

AMD Zen3 “Milan” large-memory nodes

normal_q preemptable_q

formerly in largemem_q and hugemem_q

--mem=<size> larger than 768G

Tinkercliffs

AMD Zen2 “Rome” nodes

normal_q preemptable_q

dev_q removed

--constraint=amd

Tinkercliffs

AMD Zen2 “Rome” large-memory nodes

normal_q preemptable_q

formerly in largemem_q

--constraint=amd and --mem=<size> larger than 256G

Tinkercliffs

Intel “CascadeLake-AP” nodes

normal_q preemptable_q

formerly in intel_q

--constraint=intel and --constraint=avx512

Tinkercliffs

HPE 8x A100-80G GPU nodes

a100_normal_q a100_preemptable_q

no change

--constraint=hpe-A100

Tinkercliffs

Nvidia DGX 8x A100-80G GPU nodes

a100_normal_q a100_preemptable_q

formerly in dgx_normal_q

--constraint=dgx-A100

Tinkercliffs

Dell 8x H200 GPU nodes

h200_normal_q h200_preemptable_q

FY25 new acquisition

n/a - homogeneous partitions

Decoupling CPU and Memory Requests to enable “right-sizing” of jobs

CPU and memory by default will continue to be allocated together as “slices” of a node, but users will now have full capability to select exactly how much of each resource their job will need. Decoupling the previous locking-together of CPU and memory requests provides several advantages:

  • no surprise CPU core additions to compensate for memory requests.

  • extra CPU cores meant that job billing could higher than expected.

  • since unrequested CPU cores are not added to jobs to provide requested memory, more CPU cores will remain available for other jobs.

  • more accurate billing incentivizes research groups to monitor resources utilization and to “right-size” their jobs.

  • more flexible model for users who need a large number of CPUs and small memory, or small number of CPUs but large memory.

New Model

Default CPU/memory allocation behavior is unchanged to help minimize the impact of the changes and to provide a memory allocation scheme which avoids many accidental “out of memory” (OOM) situations. If you find that a job needs more memory, but you don’t need more cores, then simply request more memory.

Use the command seff <jobid> on a completed job to examine Slurm’s report of memory allocated versus used for a job and customize the memory allocation in future jobs with #SBATCH --mem=<size>[units] where <size> is an integer and [units] is one of M (default), G, or T.

Old Model

Between 2018 and 2025, the standard ARC resource request model was to allocate CPU and memory resources in fixed “slices” of a node. This meant that, conceptually, when 8 cores was requested on a Tinkercliffs normal_q node with 128 CPU cores and 256GB of memory, the job would get the 8 requested cores and a proportional amount of node memory ((8/128)*256=16GB). Likewise, if a job requested 1 CPU core and 16GB of memory, Slurm was configured to allocate the proportional share of CPU cores to go with the memory. The job would again get 8 CPU cores and 16GB of memory.

Job Billing

Free Tier Increases to 1M units per PI monthly

ARC is implementing a unified billing model along with the other changes intended to make the usage of the cluster more uniform and convenient. This could result in more units being consumed by a job than before due to the decoupling of CPU, memory, and GPU resources.

For example, a Tinkercliffs normal_q full-node will cost about 143 units per hour due to billing for memory where previously it was only 128 units per hour - an increase of about 12%. However the free-tier monthly allocation is also increasing from 800,000 units per PI to 1,000,000 units (25% increase).

We decoupled the consumption of different GPU models and generations. Older and slower GPUs will have a lower cost than newer and faster GPUs. Additionally, with the decoupling of memory and CPU, users are better able to reduce effective billing by examing resource utilization of completed jobs and tuning future jobs to request precisely the CPUs and memory they need. You can access the billing calculator for more details.

Billing Begins for Owl and Falcon clusters

New cluster resources are sometimes released with no billing to help encourage adoption and migration to the new resources and to provide a grace period while adapting jobs. The Owl and Falcon clusters were both released for general use in the Fall 2024 term with zero billing. After the May maintenace, usage of all clusters will be billed consistently.

Billing Reflects All Resources Allocated

Job billing will now take into account all the requested resources: CPU, memory, and GPU. You can access the billing calculator for more details.

Adding QOS Options for Scheduling Flexibility

We are introducing some optional user-selectable Quality of Service (QOS) options to provide enhanced flexibility to balance tradeoffs in resource scale and scheduling priority. Jobs are scheduled in order of priority with higher priority jobs generally being scheduled sooner. While multiple factors affect a job’s total priority calculation, the QOS factor is perhaps the most impactful.

Tradeoff examples:

  • get higher priority for a job, but the maximum timelimit is reduced

  • get an extended duration for a job, but scheduling priority is reduced and the job is more limited than normal

  • get a larger job than is normally allowed, but maximum timelimit is reduced

Concept table for the Short and Long QOS’s:

QOS

Priority

Max Timelimit

Resource Scale

Base

1000

7 days

100% of partition limits

Short

2000

1 day

150% of partition limits

Long

500

14 days

25% of partition limits

Preemptable

0

30 days

12.5% of partition limits

You can run the command sacctmgr show qos to see all the QoS. You can add --qos=type-qos-name-here to select a QoS for your job. The name of the QoS extends the cluster and partition name:

  • Falcon L40s GPU partition: fal_l40s_normal_base, fal_l40s_normal_long, fal_l40s_normal_short, fal_l40s_preemptable_base

  • Falcon A30 GPU partition: fal_a30_normal_base, fal_a30_normal_long, fal_a30_normal_short, fal_a30_preemptable_base

  • Falcon V100 GPU partition: fal_v100_normal_base, fal_v100_normal_long, fal_v100_normal_short, fal_v100_preemptable_base

  • Falcon T4 GPU partition: fal_t4_normal_base, fal_t4_normal_long, fal_t4_normal_short, fal_t4_preemptable_base

  • Owl CPU partition: owl_normal_base, owl_normal_long, owl_normal_short, owl_preemptable_base

  • Tinkercliffs CPU partition: tc_normal_base, tc_normal_long, tc_normal_short, tc_preemptable_base

  • Tinkercliffs A100 GPU partition: tc_a100_normal_base, tc_a100_normal_long, tc_a100_normal_short, tc_a100_preemptable_base

  • Tinkercliffs H200 GPU partition: tc_h200_normal_base, tc_h200_normal_long, tc_h200_normal_short, tc_h200_preemptable_base

Cluster organization

Infer cluster being retired

The Infer cluster aggregated a variety of GPU resources. The P100 nodes had been in service for 9 years and have been fully eclipsed by resources in other clusters. As of the May maintenance, they are being removed from service.

The remaining T4 and V100 nodes are also aging (5 and 7 years old, respectively) but will merged into the Falcon cluster, which aligns well with their current utility. Along the way, they will get updates to their operating systems and software stacks.

Operating System Upgrade on All Clusters

After the maintenance, all ARC clusters will be running the same operating system and a common set of OS packages (Rocky 9.5). This will provide a more unified experience for accessing cluster resources.

Cluster data to be made “private”

With over a thousand active users and a million annual jobs, commands like squeue to view cluster status dump tons of data to the screen. To streamline views and protect personal information, we are enabling Slurm features which limit the visibility of most job information to only show your own jobs. From now on, you will not see other user’s jobs in squeue.

New tools for cluster monitorization

We created a number of dashboards to help users understand the load and utilization of the clusters:

  • Job resources availability. Use this dashboard to input the amount of resources you need for a job (number of CPUs, number of GPUs, and memory). The tool will filter in real time the clusters, partitions, and compute nodes that currently have these resources available to run your job immediately without delay. When your job can scale up or down, use this tool to determine the exact amount of resources your job can use without having to wait in the queue.

  • Cluster load. Use this dashboard to understand the current workload of the clusters.

  • Cluster utilization. Use this dashboard to understand the real-time and 30-day utilization of the clusters aggregared by research project, department, and college.

Software and Modules Unification and Overhaul

To make using clusters easier and more efficient for research, ARC provides pre-installed software modules for a large number of scientific applications and their dependencies. Most of these are built from source code and we attempt to tune the codes so that they make full use of the architectural features of each node type such as GPU devices and CPU microarchitecture instruction sets, particularly vectorization instructions and variants like AVX, AVX-2, and AVX-512.

We have historically performed software installations in an ad-hoc manner based on requests from researchers, but this has resulted in highly differentiated sets of available software depending on the cluster and node type. We are modifying this approach by standardizing on a common set of applications to be provided on all clusters. This should make it easier to move workloads among various cluster resources and generally reduce the likelihood of having to wait for software installations.

You can view the new modules added to the clusters (++ newmodule/version), modules deprecated and no longer available (– removedmodule/version), and modules that remain (module/version). You will need to upgrade your code to work with the latest software module versions and names.

Mount point updates for software stacks

ARC has used several different mount points for the software we provide. Most included reference to the system name and elements of the node micro-architecture. This made paths long and complex and also made it more complicated to search for and load some modules. (e.g. module load tinkercliffs-rome/MATLAB/<version>). We are streamlining these mount points in a way that will provide a consistent experience within node-types and also across clusters. All modules will be named the same way in all clusters, helping the portability of your job to other partitions and clusters.

Installation system

New standard mount

Example of previous mount

EasyBuild software

/apps/arch/software/

/apps/easybuild/software/tinkercliffs-rome

EasyBuild modules

/apps/arch/modules/

/apps/easybuild/modules/tinkercliffs-rome/all

Manually installed software

/apps/arch/software/

/apps/packages/tinkercliffs-rome/

Manually installed modules

/apps/arch/modules/

/apps/modulefiles/tinkercliffs-rome/

Answers for some frequently asked questions (FAQ)

Job Script Syntax and Parameters

sbatch: error: invalid partition specified: xxx

We have consolidated partitions to make more resources available to jobs without having to guess and check multiple partitions. See the section above on partition changes above for more details and a list of available partitions.

If you used dev_q partitions for increased job priority for a few short jobs, then you may consider using the short QOS option as described above.

Software

“The software module I used before maintenance isn’t there now. Can you reinstall it for me?”

We installed the latest version of all software packages. However, if you need a specific older version of a software package we can install it for you. We will do it when code/data is not forward compatible (e.g. code/data for 2023 version isn’t compatible with 2024 version). Additionally, ARC is making a concerted effort to standardize the software available across all clusters.

Use module spider <string> from the command line to search for packages which are already installed. If a package you need is not found, please submit a help request via https://arc.vt.edu/help. We will add it to our to-do list.

“What does ‘Legacy Apps’ on Open OnDemand mean?”

Before the maintenance, our Open OnDemand (OOD) apps were developed for the Tinkercliffs cluster with containerized implementations. Due to the containerization, much of the apps functionality remains in tact after the update to the operating system. Therefore, we are keeping them available as we continue to develop OOD. While we have tested the more commonly used ones like RStudio, Matlab, and Desktop, be advised that there may be some issues with Legacy apps since they were developed for the previous system.

We are actively developing new and improved apps. You can see the recently release apps in the “Interactive Apps” dropdown as before.

Billing