Abusive Use of Login Nodes
What is a login node?
The login nodes on a compute cluster are a shared resource. They need to be readily available for numerous tasks and see steady use by a constant stream of ARC researchers and students. It is very common to see 40-60 simultaneous login sessions at any given time on any ARC cluster login node.
A login node is sometime referred to as a “front-end” or “head” node to evoke the sense that they are system users will access as an entry point to the computational clusters. In contrast, the compute nodes are the computational workhorses of the clusters, but are not directly accessible to users outside the context of a running job.
Examples:
The Tinkercliffs cluster has two login nodes:
tinkercliffs1.arc.vt.edu
andtinkercliffs2.arc.vt.edu
The Infer cluster has one login node:
infer1.arc.vt.edu
The Owl cluster has three login nodes:
owl1.arc.vt.edu
,owl2.arc.vt.edu
, andowl3.arc.vt.edu
The Falcon cluster has two login nodes:
falcon1.arc.vt.edu
andfalcon2.arc.vt.edu
Acceptible use of a login node
Normal usage of a login node includes activities like
composing or editing a job script with a text editor like
nano
,vi
, oremacs
submitting jobs to the scheduler and monitoring the status of jobs using commands like
sbatch
,squeue
, andsacct
organizing files for job or viewing the output from a job
intiating an interactive job to get a shell on a compute node using
interact
Examples activities which are sometimes okay and sometimes are abusive
There is a significant “gray area” of workloads which are okay to run on login nodes in some cases, but are unacceptible in other cases. The deciding factor is always the impact they have on the login node. As a rule-of-thumb, if an intensive task will run for more than 2-3 minutes, it should probably be running on a compute node as part of a job.
compiling software or building python virtual environments
compressing or decompressing datasets
transferring data to/from clusters (ARC hosts a Globus data transfer node which provides better performance and will not impact any login nodes)
Unacceptible use of a login node
Any activities on a login node which noticably impacts the performance, reliability, or availability of a login node is considered unacceptible and may be subject to administrative termination.
genomic assembly or sequencing
simulations or models in StarCCM+, Ansys, COMSOL, Abaqus, Matlab, R, etc. which take more than 2-3 minutes to run
“Well, what should I be doing then?”
Get a job!
While each login node is a shared resource used by everyone who connects to a cluster, the resources allocated to you when you have a job running on a cluster are strictly for your use and yours alone. This means things will often run much faster because the processors and memory do not have to manage so much “context switching”.
The two main options for jobs are “batch” and “interactive”.
job type |
batch |
interactive |
---|---|---|
resource availability |
same: follows cluster/node type policies |
same: follows cluster/node type policies |
process control |
job script: sequence of precomposed commands in a file |
commands entered at a shell prompt in real time |
initiation |
|
|
duration |
limited only by cluster policies for jobs |
job ends when connection to login node ends unless started within a |
best used for |
task which run for a long time and need substantial resources |
tasks which require frequent user interaction |
Inspect your processes and impact on a login node:
htop
The htop
process and utilization viewer is a commmand line tool which is great to show realtime information about your running processes. A login node will have many thousands of processes, so it’s helpful to limit the view to the processes you own, but be aware that the utilization meters you see takes into account ALL processes and not just yours:
htop -u $USER
The default display format of htop
shows a meter for every core on the computer which can take up a lot of space on your display when there are 96-128 cores. You can change the layout interactively, or consider using the less intensive program top
.
From within htop
, you can view, sort, and even terminate (F9 - Kill) processes that you own.
ps
You can also you ps
to list your all your current processes in a tree format:
ps jfU $USER
systemd-cgtop
The systemd-cgtop
tool allows you to see the aggregate impact you’re having on a login node by showing you the number of tasks, total cpu utilization, and memory footprint (including cache):
systemd-cgtop user.slice/user-`id -u`.slice
“100%” = 1 full cpu core, and 225% means your processes have an equivalent impact of using 2.25 cpu cores in the most recent monitoring period.