OWL - Water-cooled AMD CPU

The OWL cluster equipment was acquired in FY23 but full commissioning of the cluster has been delayed by prerequisite datacenter renovation to integrate the direct water-cooling system with the building and datacenter where it is housed. As of February 2024, it is in the late stages of deployment and testing. It was released for general use in August 2024.

  • The compute nodes on OWL are exclusively CPU-based; there are no GPUs on OWL.

  • Direct water-cooling of the base compute nodes allows for running at boost speeds (3.8GHz) indefinitely which is 40% higher than the base clock rate. Tinkercliffs base compute nodes run at 2.0GHz.

  • AMD’s “Genoa” codename architecture is the first to feature AVX-512 instructions which provides 512-bit width vectorization (ie. eight-way FP64 SIMD in each clock-cycle). Tinkercliffs base compute nodes support the previous generation AVX2 instructions which has 256-bit width

  • 12 memory channels per socket (24 per node) provide much higher aggregate memory bandwidth and increased granularity which should provide substantial speedup for memory-bandwidth constrained workload such as finite-element analysis.

  • DDR5-4800 memory provides a nominal 50% speed increase over DDR4-3200 on Tinkercliffs

  • 768GB memory per node provides 8GB memory per core compared to Tinkercliffs which has 2GB/core

  • Three nodes are equipped with very-large memory (4TB or 8TB) enabling computational workloads for which we have never had sufficient memory resources.

The large memory nodes were not available in the AMD "Genoa" package at the time of acquisition and equipped with different processors (detail below) and are not water-cooled.

Overview

Base Compute Nodes

Large Memory

Huge Memory

Totals

Vendor

Lenovo

Lenovo

Lenovo

Chip

AMD EPYC 9454 - Genoa

AMD EPYC 7763 Milan

AMD EPYC 7763 Milan

Nodes

84

2

1

87

Cores/Node

96

128

128

Memory (GiB)/Node

768 DDR5-4800

4019 DDR4-3200

8038 DDR4-3200

Local Disk

2.9TB NVMe

2.9TB NVMe

2.9TB NVMe

Interconnect

shared 200Gbps HDR Infiniband:
100Gbps effective data rate

shared 200Gbps HDR Infiniband:
100Gbps effective data rate

shared 200Gbps HDR Infiniband:
100Gbps effective data rate

Total Memory

64512

8038

8038

Total Cores

8064

256

128

8448

Theoretical Peak

245.1456 TFLOPS

Policies

Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications. These are the limits applied to free tier usage (note that the terms “cpu” and “core” are used interchangably here following Slurm terminology):

Policies for Main Usage Queues/Partitions

The normal_q, largemem_q, and hugemem_q are the partitions (queues) that handle the bulk of utilization on the Tinkercliffs cluster.

normal_q

largemem_q

hugemem_q

Node Type

Base Compute

Large Memory

Huge Memory

Number of Nodes

84

2

1

MaxRunningJobs (User)

32

2

2

MaxSubmitJobs (User)

32

8

4

MaxRunningJobs (Allocation)

64

8

4

MaxSubmitJobs (Allocation)

200

16

8

MaxNodes (User)

32

1

1

MaxNodes (Allocation)

48

2

1

MaxCPUs (User)

3072

128

512

MaxCPUs (Allocation)

4608

256

768

MaxWallTime

6 days

3 days

6 days

Priority (QoS)

1000

1000

1000

Policies for Development and Alternative Usage Queues/Partitions

The “dev” partitions (queues) overlap the main usage queues above, but jobs in these queues get higher priority to allow more rapid access to resources for testing and development workloads. The tradeoff is that individuals may only run a small number of short jobs in these partitions.

dev_q

preemptable_q

interactive_q

Node Type

Base Compute

Base Compute

Base Compute

Number of Nodes

84

84

4

MaxRunningJobs (User)

2

32

2

MaxSubmitJobs (User)

4

100

4

MaxRunningJobs (Allocation)

8

64

3

MaxSubmitJobs (Allocation)

16

200

6

MaxNodes (User)

32

32

1

MaxNodes (Allocation)

48

48

1

MaxCPUs (User)

3072

128

512

MaxCPUs (Allocation)

4608

256

768

MaxWallTime

4 hours

6 days

Priority (QoS)

2000

0

1000

AMD Resources

Tuning Guide AMD EPYC 9004

AOCC User Guide

Known Issues

Apptainer may experience issues on login node - use compute nodes instead

user.max_user_namespaces=0 is set as mitigation for a CVE on login nodes. Compute nodes are not affected and do not have this constraint.

Benchmarks

STREAM

HPL

HPCG

High performance conjugate gradient (HPCG) test results.

On Owl using gcc version 13.2.0 and OpenMPI version 4.1.6. (This is the foss toolchain 2023b, i.e., module load foss/2023b.)

Inputs: xdim=208, ydim=208, zdim=312, time=1800.

num MPI Processes

total memory used (GB)

execution time (s)

execution rate (GFlops/s)

2

19.30

1832.25

5.93

4

38.60

1840.99

6.50

8

77.20

1835.21

8.73

16

154.41

1974.77

16.23

32

308.83

1956.86

32.75

64

617.65

2001.58

64.165

On Owl using gcc version 11.3.1 and MVAPICH2 MPI version 2.3.7. (Using module mvapich2/gcc/64/2.3.7, i.e., module load mvapich2/gcc/64/2.3.7.)

Inputs: xdim=208, ydim=208, zdim=312, time=1800.

These data under revision.

num MPI Processes

total memory used (GB)

execution time (s)

execution rate (GFlops/s)

2

9.65

1874.33

2.51

4

9.65

1935.02

1.54

8

9.65

1929.36

0.77

16

9.65

1907.02

0.39

32

9.65

1891.03

0.39

64

9.65

1909.17

0.39

MPI

An MPI slurm script for running MPI using OpenMPI.

OpenMPI

#!/bin/bash

#SBATCH -J hpcg


## Wall time.
#SBATCH --time=2-04:00:00   # 2 days and 4 hours.

### Account.  Your account number
#SBATCH --account=your_account_number

### Queue/partition.
#SBATCH --partition=normal_q

### This requests 1 node, 1 core. 
#SBATCH --nodes=1
### Number of MPI ranks; total over all nodes.
#SBATCH --ntasks=2    
### This is the number of MPI processes per node, for MPI jobs.
#SBATCH --ntasks-per-node=2
### Number of cores per task. Includes OpenMP,
### i.e., number of OpenMP threads per MPI process.
#SBATCH --cpus-per-task=6 

## Might want to run exclusive for timing studies.
## Unless you have a good reason, comment this out; 
## can waste resources.
#SBATCH --exclusive

## Slurm output and error files.
#SBATCH -o slurm.openmpi.hpcg.%j.out
#SBATCH -e slurm.openmpi.hpcg.%j.err

## Notify me when done.
#SBATCH --mail-type=ALL             # Send email notification at the start and end of the job
#SBATCH --mail-user=your_vt_email   # Send email notification to this address


# Load modules.
module load foss/2023b


## Exports.
export OMP_NUM_THREADS=4


## Time the job with time.
## For MVAPICH2, which we are using here:
## The following are variables, for user to specify:  mycode, xdim, ydim, zdim, timedim.
time mpirun   ${mycode}     ${xdim}   ${ydim}  ${zdim}  ${timedim}

mvapich2 MPI

An MPI slurm script for running MPI using MVAPICH2.

#!/bin/bash

#SBATCH -J hpcg;mvap2


## Wall time.
#SBATCH --time=0-02:00:00 # 2 hours

### Account.  Your account number
#SBATCH --account=your_account_number

### Queue.
#SBATCH --partition=normal_q

### This requests 1 node. 
#SBATCH --nodes=1
### Number of MPI ranks (i.e., processes); total over all nodes.
#SBATCH --ntasks=2    
### This is the number of MPI processes per node, for MPI jobs.
#SBATCH --ntasks-per-node=2
### Number of cores per task. Includes OpenMP,
### i.e., number of OpenMP threads per MPI process.
#SBATCH --cpus-per-task=6 

## Might want to run exclusive for timing studies.
## Unless you have a good reason, comment this out; 
## can waste resources.
#SBATCH --exclusive


## Slurm output and error files.
#SBATCH -o slurm.hpcg.mvapich2.%j.out
#SBATCH -e slurm.hpcg.mvapich2.%j.err

# Load modules.
module load mvapich2/gcc/64/2.3.7

## Exports.
export OMP_NUM_THREADS=4

## Time the job with time.
## For MVAPICH2, which we are using here:
## The following are variables, for user to specify:  mycode, xdim, ydim, zdim, timedim.
time srun  ${mycode}         ${xdim}   ${ydim}  ${zdim}  ${timedim}