Basic data transfer tools
SCP
Use scp to download a file or directory from ARC
Call the scp
command from the shell command line. This is usually available in default installation of Linux, Windows (PowerShell), and MacOS. In recent tests (fall 2022), SCP significantly outperformed GUI-based tools on Windows systems such as MobaXterm and WinSCP.
The basic syntax is scp <source> <destination>
. Both the source and destination can be local or remote. When specifying a remote source or destination, you need to provide the hostname and the full path to the file or directory like this (ie. colon separated):
host.domain.tld:/path/to/file
Example Pull from Tinkercliffs
In this example we “pull” data onto the local computer (eg. a laptop, workstation or even a shell on another ARC node) from ARC systems. So the <source>
uses hostname:filename
format and the <destination>
is the current working directory which is referenced with a period “.
”.
scp tinkercliffs1.arc.vt.edu:/home/username/filename.zip .
Example Push to a projects directory on Tinkercliffs
In this example we push a directory and its contents from the local system to a /projects
directory which is mounted on a Tinkercliffs login node:
scp -r dirname tinkercliffs2.arc.vt.edu:/projects/mygroup/
The “-r
” option is for a “recursive” transfer which means the referenced directory and all of its contents will be transferred and the directory structure will be retained on the destination. If the “-r
” is not specificied, but the source is a directory, then scp
will fail with an error like:
cp: omitting directory ‘dirname‘
RSYNC
rsync
“a fast, versatile, remote (and local) file-copying tool” is a standard tool on linux and unix system which has a long list of options you can turn on or off to customize the nature of the transfer. It is particularly well-suited for performing a synchronizition where different versions of a data collection reside in two locations because it can minimize the amount of data transferred and being able to resume a partially completed transfer. scp
or cp
, on the other hand, will always perform an entire copy from source to destination, even if files and directories already exist at the destination.
Best practices for transfers
Package datasets with a large number of files before transferring
If you need to transfer a dataset which has a large number of small files, use tar
or zip
to package the dataset into a smaller number or larger files. Most tools will process files in a dataset sequentially and there is significant overhead from the OS, network, and storage devices when many small files are transferred this way. A single, large-file transfer, on the other hand, will only incur this overhead latency once and the rest of the time will be spent in high-bandwidth transfers.
For context in these scenarios
“small files” means files smaller than 10MB
“large number of files” means thousands or more: 1000+
This is applicable for any transfer of a large number of small files, even intra-cluster. In many cases, it can be very effective to copy a data set (for example AI/ML training data) to local scratch space on compute nodes. See this example for more detail
Parallelize data transfers when possible
Most, if not all, of ARC’s networked storage systems (eg. /home
, /fastscratch
, /projects
) are capable of managing many simultaneous data flows and are mounted via a protocol which has inherent performance limitations in a single data transfer which is much lower than the aggregate performance of several streams running in parallel. Standard tools like cp
, mv
, scp
, and rsync
will process the source arguments in serial which means only one file is copied at a time. To engage the full bandwidth of the networked storage system, we need to parallelize, or force multiple simultaneous transfers.
In this example benchmark, GNU parallel is used to launch a varying number of simultaneous copies from /fastscratch
to the $TMPNVME
on a DGX compute node. Performance improves dramatically by parallizing, but does plateau at around eight simultaneous copies.
rclone
Login to OnDemand: https://ood.arc.vt.edu
Start Remote Desktop
Start shell via link in the job card
tmux ls
Start tmux for job. For example, this is for job with id 447439:
tmux a -t OOD.tmux.447439
module load rclone/1.42-foss-2020a-amd64
Now follow: https://rclone.org/drive/
Example: Config rclone for metfaces
As an example, to download the metfaces dataset (big, so beware):
rclone config
> n
> metfaces
storage> 11
client_id> {blank}
client_secret> {blank}
scope> 1
Next is folder id, for instance for metfaces: https://drive.google.com/drive/folders/1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC
root_folder_id>1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC
service_account_file> {blank}
service_account_file> {blank}
Y/n> y
Now, copy the address shown “http://127.0.0.1:53682/auth” for instance.
Go to the Remote Desktop, start Firefox, and head to that web address.
Go back to the rclone config
Y/n> n
Y/e/d> y
/n/d/r/c/s/q> q
To start using the rclone you just setup, you can do for instance:
Get a listing of files
rclone ls metfaces:
Download the data in the metfaces google drive to current dir
rclone copy metfaces: ./