Storage Overview

Below is a table of the general storage places available on ARC and descriptions of when to use each. Please also review our Data Best Practices page for more details about data clean up, data recovery, file permissions, and tips to compress your data.

Name

Intent

Per User Maximum

Data Lifespan

Notes

Home

Long-term storage of user data or compiled exxecutables

640 GB

As long as the user account is active

Review data clean up tips.

Project

Long-term storage of shared group data/files

50 TB per faculty researcher

As long as the project account is active

Additional storage avaliable for purchasing. Review data permissions.

Scratch

Short-term storage. Preferred place to store data during calculations (i.e. not in Home).

No size limits enforced

90 days

More details in the scratch and local scratch sections. Automatic deletion. Each cluster has its own scratch.

Archive

Long-term storage for infrequently-accessed files

Available for purchasing

Length of the purchase agreement

Managed by ARC staff

Home

Home provides long-term storage for system-specific data or files, such as installed programs or compiled executables. Home can be reached the variable $HOME, so if a user wishes to navigate to their Home directory, they can simply type cd $HOME. Each user is provided a maximum of 640 GB in their Home directories (across all systems). Home directories are not allowed to exceed this limit.

Monitor your usage: a full $HOME will cause many complications

  • running jobs fail if they try to write to a Home directory when the hard limit is reached

  • You will not be able to log in to Open OnDemand (https://ood.arc.vt.edu)

  • many applications store configuration information in your home directory and will begin to fail in various ways if they cannot write to it

  • Please refer to our Data Clean Up page for more details.

Avoid reading/writing data to/from HOME in a job or using it as a working directory. Stage files into a “scratch” location to keep unnecessary I/O off of the HOME filesystem and improve performance, use Scratch and Local Scratch.

Project

Project provides long-term storage for files shared among a research project or group, facilitating collaboration and data exchange within the group. Each Virginia Tech faculty member can request group storage up to the prescribed limit at no cost by requesting a storage allocation via ColdFront. Additional storage may be purchased through the investment computing or cost center programs.

Due to the huge size of the file system (10PB) the storage is not backed up. If you would like your data to be backed up you may purchase back up storage via the cost center.

For more details about specific data permissions in the projects directory, please refer to our Data Permissions page/section for more details.

Scratch (temporary) storage

Scratch is the preferred location to use when running calculations on ARC especially if your code has to read/write a lot of files during the calculation. It is important to note that each cluster (e.g. Tinkercliffs, Owl, and Falcon) each have their own scratch file system. So for example, data stored in scratch on Tinkercliffs will not be accessible from the Owl cluster and so forth. See the table below for a breakdown for the types of Scratch and Local Scratch that ARC has:

Name

Intent

Per User Maximum

Data Lifespan

File System

Environment Variable

Available On

Scratch

Short-term storage. Preferred place to store data during calculations (i.e. not in Home).

No size limits enforced

90 days

Vast

- n/a -

Login and compute nodes

Local Scratch (TMPDIR or TMPNVME)

Fast, temporary storage. Auto-deleted when job ends

Size of node hard drive

Length of Job

Local disk hard drives, usually spinning disk or SSD

$TMPDIR

Compute Nodes

Memory (TMPFS)

Very fast I/O

Size of node memory allocated to job

Length of Job

Memory (RAM)

$TMPFS

Compute Nodes

Scratch is a shared resource and has limited capacity, but individual use at any point in time is unlimited provided the space is available. A strict automatical deletion policy is in place wherein any file will be automatically deleted when it has reached an age of 90 days on /scratch.

Tips for using Scratch:

  • Create a directory for yourself mkdir /scratch/<username>.

  • Stage files for a job or set of jobs.

  • Check timestamps using ls -l.

  • Keep the number of files and directories relatively small (i.e., less than 10,000). It is a network-attached filesystem and incurs the same performance overhead for file operations that you would get with /home or /projects.

  • Immediately copy any files you want to keep to a permanent location to avoid accidental deletion from the 90-day automatic deletion policy

  • rsync gives new timestamps by default. Do not use the -t --times and -a --archive options which will reserve source timestamps.

  • cp gives new timestamps by default. Avoid the -p --preserve option which will preserve source timestamps.

  • mv preserves source timestamps by default and there are no options to override this. Use cp instead. This is a general best practice for inter-filesystem transfers anyway.

  • wget preserves source timestamps by default. Override this with wget --no-use-server-timestamps ...

Automatic Deletion Details

As mentioned above, files and directories in /scratch will be automatically deleted based on aging policies. Here is how that works:

  1. The storage system runs an hourly job to identify files which have exceeding the aging policy (90 days) and adds these to the deletion queue.

  2. The storage system runs an automated job at 12:00am UTC (7:00PM EST) every day to process the deletion queue.

  3. Additionally, the storage system will detect and delete all empty directories regardless of age.

Local Scratch

Running jobs are given a workspace on the local drives on each compute node which are allocated to the job. The path to this space is specified in the $TMPDIR environment variable. This provides a higher performing option for I/O which is a bottleneck for some tasks that involve either handling a large volume of data or a large number of file operations.

Note

Any files in local scratch are removed at the end of a job, so any results or files to be kept after the job ends must be copied to another location as part of the job. Scratch is a good choice for most people.

Solid State Drives (SSDs)

Solid state drives do not use rotational media (spinning disks/platters) but memory-like flash storage which gives it better performance characteristics. The environment variable $TMPSSD is set to a directory on an SSD accessible to the owner of a job when SSD is available on compute nodes allocated to a job.

NVMe Drives

Same idea as Local Scratch, but on NVMe media which “has been designed to capitalize on the low latency and internal parallelism of solid-state storage devices.” Running jobs are given a workspace on the local NVMe drive on each compute node if it is so equipped. The path to this space is specified in the $TMPNVME environment variable. This provides another option for users who would prefer to do I/O to local disk (such as for some kinds of big data tasks). Please note that any files in local scratch are automatically removed at the end of a job, so any results or files to be kept after the job ends must be copied to Home or Project.

Memory as storage

Running jobs have access to an in-memory mount on compute nodes via the $TMPFS environment variable. This should provide very fast read/write speeds for jobs doing I/O to files that fit in memory. Please note that these files are removed at the end of a job, so any results or files to be kept after the job ends must be copied to Work or Home.

Archive

If you need an additional solution for data backup or long-term storage, we offer an archive option. Archive provides users with long-term storage for data that does not need to be frequently accessed i.e. storing important/historical results and for data preservation purposes to comply with mandates of federal grants (data retention policies). Archive is not mounted on the clusters. Archive is accessible only through ARC staff. Researchers can compress their datasets on the clusters and ARC staff will transfer to the archive.

Archive storage may be purchased through the investment computing or cost center programs. Please reach out to us to acquire archive storage.