Performance Comparison of Scratch vs. Various ARC Filesystems

Test Results Summary

The table below shows some informal test timings of file actions performed on a relatively small sample dataset which contains a large number of files. This type of data set can be a major challenge for many filesystems because a large portion of the time needed to process them is spent in overhead operations for the operating system, networking, storage protocol, and storage subsystems.

When the filesystem is attached via a network (as are /home and /projects on ARC systems), there is a extra layer of overhead for the network communications and storage protocols. While ARC systems are interconnected with some of the fastests and lowest possible latency networks, the aggregate impact of that latency when performing on the order of 10^5 operations and beyond can be very noticeable.

Sample fileset properties:

format

size

number of files

mean file size

stdev

min

medianmax

tar

9244907520 bytes (8.7GiB)

1290284

7165 bytes

26623

21

1785

Table of results

target filesystem

copy from HOME

untar (s)

find (s)

delete (s)

HOME

- n/a -

6365.208

276.925

314.559

k80_q node NVME

11.487

42.125

2.688

-

A100 node NVME

17.486

25.424

1.653

2.130

PROJECTS

9.352

2520

664.77

/fastscratch

25.385

5906.447

89.391

2821.392

Lessons to infer from these results

Data needs to be close to compute

It is a widely used mantra that “data locality” is critical for compute performance and these tests provide a nearly real-world example.

Keep many-small-files datasets tarred on networked file systems like /home and /projects

Transferring data makes it more likely to be in a nearby cache

NVMe built-in parallelism can be a huge advantage

Cascades K80 node with NVMe test results:

# get a job on a k80_q node
[brownm12@calogin2 ~]$ salloc --nodes=1 --exclusive --partition=k80_q --account=arctest
salloc: Granted job allocation 926141
salloc: Waiting for resource configuration
salloc: Nodes ca005 are ready for job
[brownm12@ca005 ~]$ cd $TMPNVME
[brownm12@ca005 926141]$ ll
total 0

# copy from /home to TMPNVME
[brownm12@ca005 926141]$ time cp ~/fstest/mil.tar .

real	0m11.487s
user	0m0.010s
sys	    0m8.308s

# untar from TMPNVME -> TMPNVME
[brownm12@ca005 926141]$ time tar -xf mil.tar

real	0m42.125s
user	0m4.399s
sys 	0m37.456s

# Count the files extracted from the tar
[brownm12@ca005 926141]$ time find ./10* | wc -l
1290284

real	0m2.688s
user	0m1.009s
sys	    0m1.808s

Cascades login node working in $HOME

# Untar in /home -> /home
[brownm12@calogin2 fstest]$ time tar -xf mil.tar

real    106m5.208s
user    0m21.187s
sys     4m9.755s

# Count the files extracted from the tar
[brownm12@calogin2 fstest]$ time find ./10* | wc -l
1290284

real    4m36.925s
user    0m3.257s
sys     0m20.711s

# rm on /home
[brownm12@calogin2 fstest]$ time rm -rf 10*

real    50m14.559s
user    0m6.426s
sys     1m38.699s

Tinkercliffs A100 node with NVMe drive tests

[brownm12@tc-gpu001 tmp]$ time cp ~/fstest/mil.tar .

real	0m17.486s
user	0m0.002s
sys 	0m5.363s
[brownm12@tc-gpu001 tmp]$ time tar -xf mil.tar

real	0m25.424s
user	0m2.717s
sys	    0m22.601s
[brownm12@tc-gpu001 tmp]$ time find ./10* | wc -l
1290284

real	0m1.653s
user	0m0.647s
sys 	0m1.074s
[brownm12@tc-gpu001 tmp]$ time rm -rf ./10*

real	0m32.130s
user	0m0.786s
sys	    0m26.716s
[brownm12@tc-gpu001 tmp]$ time tar -c 10* > /dev/null

real	0m6.420s
user	0m3.210s
sys 	0m3.188s
[brownm12@tc-gpu001 tmp]$ time tar -cf mil2.tar 10*

real	0m13.066s
user	0m3.787s
sys	    0m9.230s

Tinkercliffs login node testing against /fastscratch

#Copy from $HOME to /fastscratch
[brownm12@tinkercliffs2 brownm12]$ time cp $HOME/fstest/mil.tar .
real	0m25.385s
user	0m0.002s
sys	    0m6.788s

#Untar /fastscratch -> /fastscratch
[brownm12@tinkercliffs2 brownm12]$ time tar -xf mil.tar
real	98m26.447s
user	0m4.996s
sys	    1m23.815s

#Use find to count the files in the unpacked dataset
[brownm12@tinkercliffs2 brownm12]$ time find ./10* | wc -l
1290284
real	1m29.391s
user	0m0.827s
sys	    0m6.329s

# Delete files from /fastscratch
[brownm12@tinkercliffs2 brownm12]$ time rm -rf ./10*
real	47m1.392s
user	0m1.077s
sys	1m4.614s