Common Datasets

ARC clusters provide central storage for commonly used, large, open datasets. This helps reduce infrastructure costs by eliminating some unnecessary duplication and allows researchers to reserve their storage allocations for their own data.

How to Use Common Datasets

Common datasets are stored in the /common/data/ directory. They are accessible from all the clusters. For instance, /common/data/models/ contains many Large Language Models downloaded from Hugging Face. Direct your software to use these models which reside on the fast, flash-based filesystem instead of using up space in your home directory.

Requests

Please submit an ARC Helpdesk request if you know of a dataset to be added to these locations. Please consider the following

Does the dataset’s licensing permit sharing in this manner?
Will several VT research groups be likely to benefit from the centralized hosting?

Submit a request via 4help:

Include “Request dataset to be added to /common on ARC systems” as the subject
Provide a link or reference to the dataset
Supply brief description of the data and its utility for your applications