Infer, GPU Cluster

Overview

Infer came online in 2021 and has 99 nodes, 2,688 CPU cores, 38 TB RAM, and 179 GPUs (80 NVIDIA V100 GPUs, 19 NVIDIA T4 GPUs, and 80 NVIDIA P100 GPUs).

Vendor

HPE

Dell

Dell

Chip

Intel Xeon Gold 6130

Intel Xeon E5-2680v4 2.4GHz

Intel Xeon Gold 6136 3.0GHz

Nodes

18

40

40

Cores/Node

32

28

24

GPU Model

Nvidia Tesla T4

Nvidia Tesla P100

Nvidia Tesla V100

GPU/Node

1

2

2

Memory (GB)/Node

192

512

384

Total Cores

576

1,120

960

Total Memory (GB)

3,456

20,480

15,360

Local Disk

480GB SSD

187GB SSD

120GB SSD

Interconnect

EDR-100 IB

Ethernet

Ethernet

Login

ARC users can log into Infer at:

infer1.arc.vt.edu

Policies

Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications:

t4_normal_q

t4_dev_q

p100_normal_q

p100_dev_q

v100_normal_q

v100_dev_q

Node Type

T4 GPU

T4 GPU

P100 GPU

P100 GPU

V100 GPU

V100 GPU

Billing Weight

0 (no billing)

0 (no billing)

0 (no billing)

0 (no billing)

0 (no billing)

0 (no billing)

Number of Nodes

16

2

38

2

38

2

MaxRunningJobs (User)

10

2

16

2

16

2

MaxSubmitJobs (User)

100

3

32

4

32

4

MaxRunningJobs (Allocation)

20

3

24

4

24

3

MaxSubmitJobs (Allocation)

200

6

48

8

48

6

MaxNodes (User)

8

2

16

16

16

16

MaxNodes (Allocation)

12

2

24

24

24

24

MaxCPUs (User)

256

64

224

224

256

256

MaxCPUs (Allocation)

384

64

336

336

384

384

MaxGPUs (User)

8

2

16

16

16

16

MaxGPUs (Allocation)

12

2

24

24

32

32

Max Job Duration (hours)

72

4

144

4

144

4

Modules

Infer’s module structure is similar to that of TinkerCliffs, but different from previous ARC clusters in that it uses a new application stack/module system based on EasyBuild. A video tutorial of module usage under this paradigm is provided here; a longer class on EasyBuild, including how you can use it to build your own modules is here.

Key differences between EasyBuild and our legacy paradigm from a user perspective include:

  • Hierarchies are replaced by toolchains. Right now, there are four:

    • foss (“Free Open Source Software”): gcc compilers, OpenBLAS for linear algebra, OpenMPI for MPI, etc

    • fosscuda: foss with CUDA support

    • intel: Intel compilers, Intel MKL for linear algebra, Intel MPI

    • intelcuda: intel with CUDA support

  • Instead of loading modules individually (e.g., module load intel mkl impi), a user can just load the toolchain (e.g., module load fosscuda/2020b).

  • Modules load their dependencies, e.g.,

$ module reset; module load GROMACS/2020.4-fosscuda-2020b; module list
Currently Loaded Modules:
  1) shared                       8) GCCcore/10.2.0                15) numactl/2.0.13-GCCcore-10.2.0     22) GDRCopy/2.1-GCCcore-10.2.0-CUDA-11.1.1  29) FFTW/3.3.8-gompic-2020b
  2) gcc/9.2.0                    9) zlib/1.2.11-GCCcore-10.2.0    16) XZ/5.2.5-GCCcore-10.2.0           23) UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1    30) ScaLAPACK/2.1.0-gompic-2020b
  3) slurm/slurm/19.05.5         10) binutils/2.35-GCCcore-10.2.0  17) libxml2/2.9.10-GCCcore-10.2.0     24) libfabric/1.11.0-GCCcore-10.2.0         31) fosscuda/2020b
  4) apps                        11) GCC/10.2.0                    18) libpciaccess/0.16-GCCcore-10.2.0  25) PMIx/3.1.5-GCCcore-10.2.0               32) GROMACS/2020.4-fosscuda-2020b
  5) site/infer/easybuild/setup  12) CUDAcore/11.1.1               19) hwloc/2.2.0-GCCcore-10.2.0        26) OpenMPI/4.0.5-gcccuda-2020b
  6) useful_scripts              13) CUDA/11.1.1-GCC-10.2.0        20) libevent/2.1.12-GCCcore-10.2.0    27) OpenBLAS/0.3.12-GCC-10.2.0
  7) DefaultModules              14) gcccuda/2020b                 21) Check/0.15.2-GCCcore-10.2.0       28) gompic/2020b
  • All modules are visible with module avail. So in many cases it is probably better to search with module spider rather than printing the whole list.

  • Some key system software, like the Slurm scheduler, are included in default modules. This means that module purge can break important functionality. Use module reset instead.

  • Lower-level software is included in the module structure (see, e.g., binutils in the GROMACS example above), which should mean less risk of conflicts in adding new versions later.

  • Environment variables (e.g., $SOFTWARE_LIB) available in our previous module system may not be provided. Instead, EasyBuild typically provides $EBROOTSOFTWARE to point to the software installation location. So for example, to link to NetCDF libraries, one might use -L$EBROOTCUDA/lib64 instead of the previous -L$CUDA_LIB.