ARC System Changes: 2025-05

May 19-23, ARC systems will be offline for regular maintenance. During this time a number of major changes will be implemented. This page presents an outline of the changes, explains the impact it may have on your use of ARC systems, and provides a FAQ regarding these changes.

If you have questions about this, there are several ways to get more information or request help:

Submit a help request via https://arc.vt.edu/help
Attend ARC office hours: https://arc.vt.edu/office-hours

Partition Changes

Consolidation of partitions

ARC clusters host a wide variety of resource types because Virginia Tech researchers have a wide variety of computational needs. By grouping resources into larger partitions, jobs will have access to more resources thus helping them start faster.

All CPU-only partitions within a cluster will be combined into a single partition.
GPU partitions will be combined when the GPU devices are of the same model.
Infer cluster will be shutdown. V100 and T4 GPUs will move to Falcon.
New h200_normal_q partition with 7 nodes and 56 NVIDIA H200-141G GPUs!
dev_q partitions will be removed (see info on Quality of Service (QoS) for priority scheduling for short jobs).
Users can use features (--constraint) if they need specific hardware features (e.g. AVX512).
QoS options will let users select a QoS with different walltime, priority, and resource limits.

Cluster	Node Types	Partitions	Notes	User-selectable features
Falcon	A30 GPU nodes	`a30_normal_q` `a30_preemptable_q`	`a30_dev_q` removed	n/a - homogeneous partitions
Falcon	L40S GPU nodes	`l40s_normal_q` `l40s_preemptable_q`	`l40s_dev_q` removed	n/a - homogeneous partitions
Falcon	V100 GPU nodes	`v100_normal_q` `v100_preemptable_q`	formerly part of the Infer cluster	n/a - homogeneous partitions
Falcon	T4 GPU nodes	`t4_normal_q` `t4_preemptable_q`	formerly part of the Infer cluster	n/a - homogeneous partitions
Owl	AMD Zen4 “Genoa” nodes	`normal_q` `preemptable_q`	`dev_q` removed	`--constraint=genoa` and `--constraint=avx512`
Owl	AMD Zen3 “Milan” large-memory nodes	`normal_q` `preemptable_q`	formerly in `largemem_q` and `hugemem_q`	`--constraint=milan`
Tinkercliffs	AMD Zen2 “Rome” nodes	`normal_q` `preemptable_q`	`dev_q` removed	`--constraint=amd`
Tinkercliffs	AMD Zen2 “Rome” large-memory nodes	`normal_q` `preemptable_q`	formerly in `largemem_q`	`--constraint=amd`
Tinkercliffs	Intel “CascadeLake-AP” nodes	`normal_q` `preemptable_q`	formerly in `intel_q`	`--constraint=intel` and `--constraint=avx512`
Tinkercliffs	HPE 8x A100-80G GPU nodes	`a100_normal_q` `a100_preemptable_q`	no change	`--constraint=hpe-A100`
Tinkercliffs	Nvidia DGX 8x A100-80G GPU nodes	`a100_normal_q` `a100_preemptable_q`	formerly in `dgx_normal_q`	`--constraint=dgx-A100`
Tinkercliffs	Dell 8x H200 GPU nodes	`h200_normal_q` `h200_preemptable_q`	new partition	n/a - homogeneous partitions

Decoupling CPU and Memory Requests to enable “right-sizing” of jobs

CPU and memory by default will continue to be allocated together as “slices” of a node, but users will now have full capability to select exactly how much of each resource their job will need. Decoupling the previous locking-together of CPU and memory requests provides several advantages:

CPU cores are not unexpectedly added to jobs to compensate for memory requests. These extra CPU cores meant that job billing could be higher than expected.
Since unrequested CPU cores are not added to jobs to provide requested memory, more CPU cores will remain available for other jobs.
More accurate billing incentivizes research groups to monitor resources utilization and to “right-size” their jobs.
More flexible model for users who need a large number of CPUs but small memory, or small number of CPUs but large memory.

New Model post May 2025

Default CPU/memory allocation when the user does not specify additional parameters remains unchanged to provide a memory allocation scheme that avoids accidental Out Of Memory (OOM) - errors. If you find that a job needs more memory, but you don’t need more cores, then simply request more memory using Slurm’s --mem=<size>[units] option, for example --mem=64G.

If processes running in a job exceed the job’s memory allocation, then Slurm will kill those processes and produce an error like error: tc111: task 0: Out Of Memory

Slurm’s sacct command will also indicate when jobs end in an OOM:

[brownm12@tinkercliffs2 ~]$ sacct -j 3093438
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3093438          stress   normal_q     arcadm          1 OUT_OF_ME+    0:125
3093438.ext+     extern                arcadm          1  COMPLETED      0:0
3093438.0        stress                arcadm          1 OUT_OF_ME+    0:125

Use the command seff <jobid> on a completed job to examine Slurm’s report of memory allocated versus used for a job and customize the memory allocation in future jobs with #SBATCH --mem=<size>[units] where <size> is an integer and [units] is one of M, G, or T, e.g. #SBATCH --mem=16G for a job with 16GB. If units are not specified, Slurm defaults to M for MB.

Old Model pre May 2025

Between 2018 and 2025, the standard ARC resource request model was to allocate CPU and memory resources in fixed “slices” of a node. This meant that, conceptually, when 8 cores were requested on a Tinkercliffs normal_q node with 128 CPU cores and 256GB of memory, the job would get the 8 requested cores and a proportional amount of node memory ((8/128)*256=16GB). Likewise, if a job requested 1 CPU core and 16GB of memory, Slurm was configured to allocate the proportional share of CPU cores to go with the memory. The job would again get 8 CPU cores and 16GB of memory, thus unnecesary overallocating resources that could be used to run other jobs.

Job Billing

Free Tier Increases to 2M units per PI monthly

ARC is implementing a unified billing model along with the other changes intended to make the usage of the cluster more uniform and convenient. This could result in more units being consumed by a job than before due to the decoupling of CPU, memory, and GPU resources.

For example, a Tinkercliffs normal_q full-node will cost about 143 units per hour due to billing for memory where previously it was only 128 units per hour - an increase of about 12%. However the free-tier monthly allocation is also increasing from 800,000 units per PI to 2,000,000 units (150% increase), ultimately increasing the total free trier capability.

We decoupled the consumption of different GPU models and generations. Older and slower GPUs will have a lower cost than newer and faster GPUs. Additionally, with the decoupling of memory and CPU, users are better able to reduce effective billing by examing resource utilization of completed jobs and tuning future jobs to request precisely the CPUs and memory they need. You can access the billing calculator for more details and the FAQ page on how to interpret Slurm billing reports.

Billing Consistently Applies to all Clusters

New cluster resources are sometimes released with no billing to help encourage adoption and migration to the new resources and to provide a grace period while adapting jobs. The Owl and Falcon clusters were both released for general use in the Fall 2024 term with zero billing. After the May maintenace, usage of all clusters will be billed consistently.

Billing Reflects All Resources Allocated

Job billing will now take into account all the requested resources: CPU cores, memory, and GPU. You can access the billing calculator for more details.

Adding QoS Options for Scheduling Flexibility

We are introducing some optional user-selectable Quality of Service (QoS) options to provide enhanced flexibility to balance tradeoffs in resource scale and scheduling priority. Jobs are scheduled in order of priority with higher priority jobs generally being scheduled sooner. While multiple factors affect a job’s total priority calculation, the QoS factor is perhaps the most impactful.

Tradeoff examples:

Get higher priority for a job, but the maximum timelimit is reduced
Get an extended duration for a job, but scheduling priority and the maximum amount of resources are reduced
Get a larger job with more resources, but maximum timelimit is reduced

Concept table for the Short and Long QoS’s. The higher priority the sooner the job will start to run.

QoS	Priority	Max Timelimit	Resource Scale	Billing
Base (default)	1000	7 days	100% of partition limits	1x
Short	2000	1 day	150% of partition limits	2x
Long	500	14 days	25% of partition limits	1x
Preemptable	0	30 days	12.5% of partition limits	1x

You can run the command showqos to see all the QoS. You can add --qos=<qos-name> to the Slurm submission script to select a QoS for your job.

Cluster organization

Infer cluster being retired

The Infer cluster aggregated a variety of GPU resources. The P100 nodes had been in service for 9 years and have been fully eclipsed by resources in other clusters. As of the May maintenance, they are being removed from service.

The remaining T4 and V100 nodes are also aging (5 and 7 years old, respectively) but will merged into the Falcon cluster, which aligns well with their current utility. Along the way, they will get updates to their operating systems and software stacks.

Operating System Upgrade on All Clusters

After the maintenance, all ARC clusters will be running the same operating system and a common set of OS packages (Rocky 9.5). This will provide a more unified experience for accessing cluster resources.

`/globalscratch` symlinks for `/scratch` are removed

In January 2025, we standardized on /scratch as the canonical mount point on all clusters for “scratch” filesystems. We provided symlinks to ease the transition for those who had previously used the /globalscratch mount point. This symlink is not provided anymore but all data remains on /scratch as before.

If you attempt to use /globalscratch in file or directory paths now, you can expect errors like the one below and can resolve them by using just /scratch instead:

cannot access '/globalscratch/username/file': No such file or directory

Use Current mount point	Unavailable Historical mount points
`/scratch`	`/globalscratch` `/fastscratch`

Use /scratch when you need a high-speed storage to temporarily store files in jobs that are very intense in I/O operations. Read/write speeds and latency of /scratch is significantly faster than /home and /projects.

Cluster data to be made “private”

With over a thousand active users and a million annual jobs, commands like squeue to view cluster status dump tons of data to the screen. To streamline views and protect personal information, we are enabling Slurm features which limit the visibility of most job information to only show your own jobs. From now on, you will not see other user’s jobs in squeue and only your jobs (running or pending) will be shown.

New tools for cluster monitorization

We created a number of dashboards to help users understand the load and utilization of the clusters:

Job resources availability. Use this dashboard to input the amount of resources you need for a job (number of CPUs, number of GPUs, and memory). The tool will filter in real time the clusters, partitions, and compute nodes that currently have these resources available to run your job immediately without delay. When your job can scale up or down, use this tool to determine the exact amount of resources your job can use without having to wait in the queue.
Cluster load. Use this dashboard to understand the current workload of the clusters.
Cluster utilization. Use this dashboard to understand the real-time and 30-day utilization of the clusters aggregared by research project, department, and college.
Cluster wait times. Use this dashboard to learn about the wait times currently experiences on each of the partitions.

Software and Modules Unification

To make using clusters easier and more efficient for research, ARC provides pre-installed software modules for a large number of scientific applications and their dependencies. Most of these are built from source code and we attempt to tune the codes so that they make full use of the architectural features of each node type such as GPU devices and CPU microarchitecture instruction sets, particularly vectorization instructions and variants like AVX, AVX-2, and AVX-512.

We have historically performed software installations in an ad-hoc manner based on requests from researchers, but this has resulted in highly differentiated sets of available software depending on the cluster and node type. We are modifying this approach by standardizing on a common set of applications to be provided on all clusters. This should make it easier to move workloads among various cluster resources and generally reduce the likelihood of having to wait for software installations.

You can view the new modules added to the clusters (++ newmodule/version), modules deprecated and no longer available (– removedmodule/version), and modules that remain (module/version). You will need to update your code to work with the latest software module versions and names.

Mount point updates for software stacks

ARC has used several different mount points for the software we provide. Most included reference to the system name and elements of the node micro-architecture. This made paths long and complex and also made it more complicated to search for and load some modules. (e.g. module load tinkercliffs-rome/MATLAB/<version>). We are streamlining these mount points in a way that will provide a consistent experience within node-types and also across clusters. All modules will be named the same way in all clusters, helping the portability of your job to other partitions and clusters (e.g. module load MATLAB/<version> will work on every cluster consistently).

Installation system	New standard mount	Example of previous mount
EasyBuild software	`/apps/arch/software/`	`/apps/easybuild/software/tinkercliffs-rome`
EasyBuild modules	`/apps/arch/modules/`	`/apps/easybuild/modules/tinkercliffs-rome/all`
Manually installed software	`/apps/arch/software/`	`/apps/packages/tinkercliffs-rome/`
Manually installed modules	`/apps/arch/modules/`	`/apps/modulefiles/tinkercliffs-rome/`

Frequently Asked Questions (FAQ)

Slurm jobs

`sbatch: error: invalid partition specified: xxx`

We have consolidated partitions to make more resources available to jobs without having to guess and check multiple partitions. See the section above on partition changes for more details and a list of available partitions.

If you used dev_q partitions for increased job priority for a few short jobs, then you may consider using the short QoS option as described above.

Where are the jobs I submitted to the queue before the maintenance?

Since we made significant disruptive changes to the configuration of the partitions and software, all jobs that were queued on the clusters won’t resume after the maintenance to ensure users revise their scripts to adjust to the new settings.

Why I don’t see any jobs in `squeue` anymore?

Due to the increasing number of jobs it became difficult to find your jobs in the output of squeue. Jobs run by other users will not be shown in the squeue output anymore.

Why is my job not starting?

The squeue command shows the running and pending jobs for an user, and provides the reason a pending job isn’t starting. Jobs run by other users will not be shown in the squeue output. To only show information for a particular job use squeue -j <jobid>. There are limits that apply per job, per user, and per account to ensure a fair utilization of reources among all users (see the Quality of Service (QoS) of each partition of the clusters). Consult all Slurm job reason codes which include some of the most common reasons:

Reason	Meaning
`Priority` or `Resources`	These two are the most common reasons for a job being pending (PD). They mean that the job is waiting in the queue for resources (CPUs, GPUs, and/or memory) to become available. The job will start as soon as these become available. Jobs requesting more resources are likely to sit in the queue for longer.
`QoSMaxCpuPerUserLimit`	The CPU request exceeds the maximum each user is allowed to use for the requested QoS.
`QoSMaxMemoryPerUser`	The request exceeds the maximum amount of Memory each user is allowed to use for the requested QoS.
`QoSMaxGRESPerUser`	The request exceeds the maximum number of a GRES (GPUs) each user is allowed to use for the requested QoS.
`MaxCpuPerAccount`	The job’s CPU request exceeds the per-Account limit on the job’s QoS.
`MaxMemoryPerAccount`	The job’s Memory request exceeds the per-Account limit on the job’s QoS.
`MaxGRESPerAccount`	The job’s GRES (GPUs) request exceeds the per-Account limit on the job’s QoS.
`QoSMaxWallDurationPerJobLimit`	The limit on the amount of wall time a job can request has been exceeded for the requested QoS.
`AssocGrpBillingMinutes`	The allocation to which your submitted the job has exceeded its available resources (e.g., in the free tier)

I don’t want my jobs to run on multiple nodes with heterogeneous hardware. What can I do?

We merged multiple former partitions with different types of CPU architectures into the new normal_q partition. Therefore, jobs may run on nodes with different types of CPU vendors (Intel vs AMD) and architecture features (AVX2, AVX512, etc). This works best to maximize the utilization of resources. However, advanced users may want to explicitly control the hardware architecture for consistency or to access specific features. Therefore, users may enfore feature constraints such as --constraint=amd, --constraint=intel, --constraint=avx512, --constraint=dgx-A100 to control the type of hardware the job will run.

How can I receive an email when my jobs begins or ends?

You can add to your Slurm script --mail-type=ALL --mail-user=myemail@vt.edu to let the system send you an email when your job begins to run or when it ends. It’s useful to know when you should connect and collect the results of your experiments.

Software

The software module I used before maintenance isn’t there now. Can you reinstall it for me?

We installed the latest version of all software packages (see the software available). However, if you need a specific older version of a software package we can install it for you. We will do it when code/data is not forward compatible (e.g. code/data for 2023 version isn’t compatible with 2024 version). Additionally, ARC is making a concerted effort to standardize the software available across all clusters.

Use module spider <softwarename> from the command line to search for packages which are already installed. If a package you need is not found, please submit a help request via https://arc.vt.edu/help. We will add it to our to-do list.

I usually load the `site/tinkercliffs/easybuild/setup` module, but it’s not available anymore, what should I do?

Short answer: Remove that module from your script. You probably did not need it in the first place. Also, there’s no need to include apps or shared modules. Just use module reset and then load the modules for the apps you need.

Longer answer: You should use the module reset command at the beginning of batch jobs to load the base set of modules for the current node type. You can see that the list of modules you have by default looks like this:

$ module list

Currently Loaded Modules:
  1) shared   2) slurm/slurm/24.05.4   3) apps   4) useful_scripts   5) DefaultModules

The list also used to include a cluster-specific module like site/tinkercliffs/easybuild/setup. Those modules aren’t available anymore, but were always somewhat esoteric anywhere. They provided some default settings related to easybuild which are now handled in a different way.

The other modules, shared, slurm, apps, and DefaultModules, provide some basic cluster functiona and are ALL loaded automatically for new shells AND when you issue a module reset command, so there is no reason to load them again.

Conda Virtual Environments

Users are encouraged to use centrally-installed miniconda3 or miniforge3 modules rather than user-installed in $HOME (e.g. default anaconda3 installations). See instructions on how to intall conda virtual environments.

Lmod has detected the following error: The following module(s) are unknown: “XXXX/YY”

Software packages have been upgraded to the latest versions. See the list of software available and versions and update your scripts to load the new module version.

/lib64/libc.so.6: version `GLIBC_X.XX’ not found

The operating systems have been upgraded to Rocky Linux 9.5 which comes with GLIBC 2.34. If you have compiled source code with an older compiler prior to May 2025 you might receive this error. The solutions is to recompile your code on the new systems using the new compilers and toolchains.

error while loading shared libraries: libXYZ.so.N: cannot open shared object file: No such file or directory

Libraries of the operating system and software modules have been upgraded to the latest versions. You might receive this error when your code was compiled and linked against a version of a library that no longer exists. The solution is to recompile your source code and link it to the new version of the libraries installed in the system.

Billing

How do jobs consume my free tier allocation now?

All PIs get two million service units (SUs) per month for free. When a user runs a job, it requires an amount of resources (CPU cores, memory, and GPUs) to run for a given amount of time. Therefore, the job is billed by the type and amount of resources, and the length of the job. You can access the billing calculator for more details and the FAQ page on how to interpret Slurm billing reports. The goal is to encourage a responsible utilization of resources and adjust the costs of a job to the size and type of resources allocated (e.g. newer, faster GPUs cost more SUs than older, slower GPUs).

Also, selecting a “short” QoS for your job will result in twice the billing rate it would have otherwise to compensate for the higher priority and access to more resources. Similarly, using a preemptable_q partition comes with zero billing in exchange for low piority and immediate cancel and requeue if other jobs need the resources.

For example these three jobs all use the same hardware resources, but will incur different billing:

partition	resources	duration	QoS	total billing
Tinkerliffs `a100_normal_q`	cpus=8,mem=256GB,gres=gpu:1	1:00:00	default	124
Tinkerliffs `a100_normal_q`	cpus=8,mem=256GB,gres=gpu:1	1:00:00	`tc_a100_normal_short`	248
Tinkerliffs `a100_preemptable_q`	cpus=8,mem=256GB,gres=gpu:1	1:00:00	default	0

Can I buy additional capacity and get higher priority?

Yes, PIs can purchase additional storage or computing capacity via the Cost Center. Service units acquired via the cost center have higher priority, which might be useful when you must meet a deadline.