Performance Comparison of Scratch vs. Various ARC Filesystems
Test Results Summary
The table below shows some informal test timings of file actions performed on a relatively small sample dataset which contains a large number of files. This type of data set can be a major challenge for many filesystems because a large portion of the time needed to process them is spent in overhead operations for the operating system, networking, storage protocol, and storage subsystems.
When the filesystem is attached via a network (as are /home
and /projects
on ARC systems), there is a extra layer of overhead for the network communications and storage protocols. While ARC systems are interconnected with some of the fastests and lowest possible latency networks, the aggregate impact of that latency when performing on the order of 10^5 operations and beyond can be very noticeable.
Sample fileset properties:
format |
size |
number of files |
mean file size |
stdev |
min |
medianmax |
---|---|---|---|---|---|---|
tar |
9244907520 bytes (8.7GiB) |
1290284 |
7165 bytes |
26623 |
21 |
1785 |
Table of results
target filesystem |
copy from HOME |
untar (s) |
find (s) |
delete (s) |
---|---|---|---|---|
HOME |
- n/a - |
6365.208 |
276.925 |
314.559 |
k80_q node NVME |
11.487 |
42.125 |
2.688 |
- |
A100 node NVME |
17.486 |
25.424 |
1.653 |
2.130 |
PROJECTS |
9.352 |
2520 |
664.77 |
|
/fastscratch |
25.385 |
5906.447 |
89.391 |
2821.392 |
Lessons to infer from these results
Data needs to be close to compute
It is a widely used mantra that “data locality” is critical for compute performance and these tests provide a nearly real-world example.
Keep many-small-files datasets tarred on networked file systems like /home
and /projects
Transferring data makes it more likely to be in a nearby cache
NVMe built-in parallelism can be a huge advantage
Cascades K80 node with NVMe test results:
# get a job on a k80_q node
[brownm12@calogin2 ~]$ salloc --nodes=1 --exclusive --partition=k80_q --account=arctest
salloc: Granted job allocation 926141
salloc: Waiting for resource configuration
salloc: Nodes ca005 are ready for job
[brownm12@ca005 ~]$ cd $TMPNVME
[brownm12@ca005 926141]$ ll
total 0
# copy from /home to TMPNVME
[brownm12@ca005 926141]$ time cp ~/fstest/mil.tar .
real 0m11.487s
user 0m0.010s
sys 0m8.308s
# untar from TMPNVME -> TMPNVME
[brownm12@ca005 926141]$ time tar -xf mil.tar
real 0m42.125s
user 0m4.399s
sys 0m37.456s
# Count the files extracted from the tar
[brownm12@ca005 926141]$ time find ./10* | wc -l
1290284
real 0m2.688s
user 0m1.009s
sys 0m1.808s
Cascades login node working in $HOME
# Untar in /home -> /home
[brownm12@calogin2 fstest]$ time tar -xf mil.tar
real 106m5.208s
user 0m21.187s
sys 4m9.755s
# Count the files extracted from the tar
[brownm12@calogin2 fstest]$ time find ./10* | wc -l
1290284
real 4m36.925s
user 0m3.257s
sys 0m20.711s
# rm on /home
[brownm12@calogin2 fstest]$ time rm -rf 10*
real 50m14.559s
user 0m6.426s
sys 1m38.699s
Tinkercliffs A100 node with NVMe drive tests
[brownm12@tc-gpu001 tmp]$ time cp ~/fstest/mil.tar .
real 0m17.486s
user 0m0.002s
sys 0m5.363s
[brownm12@tc-gpu001 tmp]$ time tar -xf mil.tar
real 0m25.424s
user 0m2.717s
sys 0m22.601s
[brownm12@tc-gpu001 tmp]$ time find ./10* | wc -l
1290284
real 0m1.653s
user 0m0.647s
sys 0m1.074s
[brownm12@tc-gpu001 tmp]$ time rm -rf ./10*
real 0m32.130s
user 0m0.786s
sys 0m26.716s
[brownm12@tc-gpu001 tmp]$ time tar -c 10* > /dev/null
real 0m6.420s
user 0m3.210s
sys 0m3.188s
[brownm12@tc-gpu001 tmp]$ time tar -cf mil2.tar 10*
real 0m13.066s
user 0m3.787s
sys 0m9.230s
Tinkercliffs login node testing against /fastscratch
#Copy from $HOME to /fastscratch
[brownm12@tinkercliffs2 brownm12]$ time cp $HOME/fstest/mil.tar .
real 0m25.385s
user 0m0.002s
sys 0m6.788s
#Untar /fastscratch -> /fastscratch
[brownm12@tinkercliffs2 brownm12]$ time tar -xf mil.tar
real 98m26.447s
user 0m4.996s
sys 1m23.815s
#Use find to count the files in the unpacked dataset
[brownm12@tinkercliffs2 brownm12]$ time find ./10* | wc -l
1290284
real 1m29.391s
user 0m0.827s
sys 0m6.329s
# Delete files from /fastscratch
[brownm12@tinkercliffs2 brownm12]$ time rm -rf ./10*
real 47m1.392s
user 0m1.077s
sys 1m4.614s