OWL - Water-cooled AMD CPU

The OWL cluster equipment was acquired in FY23 but full commissioning of the cluster has been delayed by prerequisite datacenter renovation to integrate the direct water-cooling system with the building and datacenter where it is housed. As of February 2024, it is in the late stages of deployment and testing. It was released for general use in August 2024.

The compute nodes on OWL are exclusively CPU-based; there are no GPUs on OWL.
Direct water-cooling of the base compute nodes allows for running at boost speeds (3.8GHz) indefinitely which is 40% higher than the base clock rate. Tinkercliffs base compute nodes run at 2.0GHz.
AMD’s “Genoa” codename architecture is the first to feature AVX-512 instructions which provides 512-bit width vectorization (ie. eight-way FP64 SIMD in each clock-cycle). Tinkercliffs base compute nodes support the previous generation AVX2 instructions which has 256-bit width
12 memory channels per socket (24 per node) provide much higher aggregate memory bandwidth and increased granularity which should provide substantial speedup for memory-bandwidth constrained workload such as finite-element analysis.
DDR5-4800 memory provides a nominal 50% speed increase over DDR4-3200 on Tinkercliffs
768GB memory per node provides 8GB memory per core compared to Tinkercliffs which has 2GB/core
Three nodes are equipped with very-large memory (4TB or 8TB) enabling computational workloads for which we have never had sufficient memory resources.

The large memory nodes were not available in the AMD "Genoa" package at the time of acquisition and equipped with different processors (detail below) and are not water-cooled.

Overview

	Base Compute Nodes	Large Memory	Huge Memory	Totals
Vendor	Lenovo	Lenovo	Lenovo
Chip	AMD EPYC 9454 - Genoa	AMD EPYC 7763 Milan	AMD EPYC 7763 Milan
Nodes	84	2	1	87
Cores/Node	96	128	128
Memory (GiB)/Node	768 DDR5-4800	4019 DDR4-3200	8038 DDR4-3200
Local Disk	2.9TB NVMe	2.9TB NVMe	2.9TB NVMe
Interconnect	shared 200Gbps HDR Infiniband: 100Gbps effective data rate	shared 200Gbps HDR Infiniband: 100Gbps effective data rate	shared 200Gbps HDR Infiniband: 100Gbps effective data rate
Total Memory	64512	8038	8038
Total Cores	8064	256	128	8448
Theoretical Peak	245.1456 TFLOPS

Policies

Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications. These are the limits applied to free tier usage (note that the terms “cpu” and “core” are used interchangably here following Slurm terminology):

Policies for Main Usage Queues/Partitions

The normal_q, largemem_q, and hugemem_q are the partitions (queues) that handle the bulk of utilization on the Tinkercliffs cluster.

	normal_q	largemem_q	hugemem_q
Node Type	Base Compute	Large Memory	Huge Memory
Number of Nodes	84	2	1
MaxRunningJobs (User)	32	2	2
MaxSubmitJobs (User)	32	8	4
MaxRunningJobs (Allocation)	64	8	4
MaxSubmitJobs (Allocation)	200	16	8
MaxNodes (User)	32	1	1
MaxNodes (Allocation)	48	2	1
MaxCPUs (User)	3072	128	512
MaxCPUs (Allocation)	4608	256	768
MaxWallTime	6 days	3 days	6 days
Priority (QoS)	1000	1000	1000

Policies for Development and Alternative Usage Queues/Partitions

The “dev” partitions (queues) overlap the main usage queues above, but jobs in these queues get higher priority to allow more rapid access to resources for testing and development workloads. The tradeoff is that individuals may only run a small number of short jobs in these partitions.

	dev_q	preemptable_q	interactive_q
Node Type	Base Compute	Base Compute	Base Compute
Number of Nodes	84	84	4
MaxRunningJobs (User)	2	32	2
MaxSubmitJobs (User)	4	100	4
MaxRunningJobs (Allocation)	8	64	3
MaxSubmitJobs (Allocation)	16	200	6
MaxNodes (User)	32	32	1
MaxNodes (Allocation)	48	48	1
MaxCPUs (User)	3072	128	512
MaxCPUs (Allocation)	4608	256	768
MaxWallTime	4 hours	6 days
Priority (QoS)	2000	0	1000

AMD Resources

Tuning Guide AMD EPYC 9004

AOCC User Guide

Known Issues

Benchmarks

STREAM

HPL

HPCG

High performance conjugate gradient (HPCG) test results.

On Owl using gcc version 13.2.0 and OpenMPI version 4.1.6. (This is the foss toolchain 2023b, i.e., module load foss/2023b.)

Inputs: xdim=208, ydim=208, zdim=312, time=1800.

num MPI Processes	total memory used (GB)	execution time (s)	execution rate (GFlops/s)
2	19.30	1832.25	5.93
4	38.60	1840.99	6.50
8	77.20	1835.21	8.73
16	154.41	1974.77	16.23
32	308.83	1956.86	32.75
64	617.65	2001.58	64.165

On Owl using gcc version 11.3.1 and MVAPICH2 MPI version 2.3.7. (Using module mvapich2/gcc/64/2.3.7, i.e., module load mvapich2/gcc/64/2.3.7.)

Inputs: xdim=208, ydim=208, zdim=312, time=1800.

These data under revision.

num MPI Processes	total memory used (GB)	execution time (s)	execution rate (GFlops/s)
2	9.65	1874.33	2.51
4	9.65	1935.02	1.54
8	9.65	1929.36	0.77
16	9.65	1907.02	0.39
32	9.65	1891.03	0.39
64	9.65	1909.17	0.39

MPI

An MPI slurm script for running MPI using OpenMPI.

OpenMPI

#!/bin/bash

#SBATCH -J hpcg


## Wall time.
#SBATCH --time=2-04:00:00   # 2 days and 4 hours.

### Account.  Your account number
#SBATCH --account=your_account_number

### Queue/partition.
#SBATCH --partition=normal_q

### This requests 1 node, 1 core. 
#SBATCH --nodes=1
### Number of MPI ranks; total over all nodes.
#SBATCH --ntasks=2    
### This is the number of MPI processes per node, for MPI jobs.
#SBATCH --ntasks-per-node=2
### Number of cores per task. Includes OpenMP,
### i.e., number of OpenMP threads per MPI process.
#SBATCH --cpus-per-task=6 

## Might want to run exclusive for timing studies.
## Unless you have a good reason, comment this out; 
## can waste resources.
#SBATCH --exclusive

## Slurm output and error files.
#SBATCH -o slurm.openmpi.hpcg.%j.out
#SBATCH -e slurm.openmpi.hpcg.%j.err

## Notify me when done.
#SBATCH --mail-type=ALL             # Send email notification at the start and end of the job
#SBATCH --mail-user=your_vt_email   # Send email notification to this address


# Load modules.
module load foss/2023b


## Exports.
export OMP_NUM_THREADS=4


## Time the job with time.
## For MVAPICH2, which we are using here:
## The following are variables, for user to specify:  mycode, xdim, ydim, zdim, timedim.
time mpirun   ${mycode}     ${xdim}   ${ydim}  ${zdim}  ${timedim}

mvapich2 MPI

An MPI slurm script for running MPI using MVAPICH2.

#!/bin/bash

#SBATCH -J hpcg;mvap2


## Wall time.
#SBATCH --time=0-02:00:00 # 2 hours

### Account.  Your account number
#SBATCH --account=your_account_number

### Queue.
#SBATCH --partition=normal_q

### This requests 1 node. 
#SBATCH --nodes=1
### Number of MPI ranks (i.e., processes); total over all nodes.
#SBATCH --ntasks=2    
### This is the number of MPI processes per node, for MPI jobs.
#SBATCH --ntasks-per-node=2
### Number of cores per task. Includes OpenMP,
### i.e., number of OpenMP threads per MPI process.
#SBATCH --cpus-per-task=6 

## Might want to run exclusive for timing studies.
## Unless you have a good reason, comment this out; 
## can waste resources.
#SBATCH --exclusive


## Slurm output and error files.
#SBATCH -o slurm.hpcg.mvapich2.%j.out
#SBATCH -e slurm.hpcg.mvapich2.%j.err

# Load modules.
module load mvapich2/gcc/64/2.3.7

## Exports.
export OMP_NUM_THREADS=4

## Time the job with time.
## For MVAPICH2, which we are using here:
## The following are variables, for user to specify:  mycode, xdim, ydim, zdim, timedim.
time srun  ${mycode}         ${xdim}   ${ydim}  ${zdim}  ${timedim}