Abusive Use of Login Nodes

What is a login node?

The login nodes on a compute cluster are a shared resource. They need to be readily available for numerous tasks and see steady use by a constant stream of ARC researchers and students. It is very common to see 40-60 simultaneous login sessions at any given time on any ARC cluster login node.

A login node is sometime referred to as a “front-end” or “head” node to evoke the sense that they are system users will access as an entry point to the computational clusters. In contrast, the compute nodes are the computational workhorses of the clusters, but are not directly accessible to users outside the context of a running job.

Examples:

  • The Tinkercliffs cluster has two login nodes: tinkercliffs1.arc.vt.edu and tinkercliffs2.arc.vt.edu

  • The Infer cluster has one login node: infer1.arc.vt.edu

  • The Owl cluster has three login nodes: owl1.arc.vt.edu, owl2.arc.vt.edu, and owl3.arc.vt.edu

  • The Falcon cluster has two login nodes: falcon1.arc.vt.edu and falcon2.arc.vt.edu

Acceptible use of a login node

Normal usage of a login node includes activities like

  • composing or editing a job script with a text editor like nano, vi, or emacs

  • submitting jobs to the scheduler and monitoring the status of jobs using commands like sbatch, squeue, and sacct

  • organizing files for job or viewing the output from a job

  • intiating an interactive job to get a shell on a compute node using interact

Examples activities which are sometimes okay and sometimes are abusive

There is a significant “gray area” of workloads which are okay to run on login nodes in some cases, but are unacceptible in other cases. The deciding factor is always the impact they have on the login node. As a rule-of-thumb, if an intensive task will run for more than 2-3 minutes, it should probably be running on a compute node as part of a job.

  • compiling software or building python virtual environments

  • compressing or decompressing datasets

  • transferring data to/from clusters (ARC hosts a Globus data transfer node which provides better performance and will not impact any login nodes)

Unacceptible use of a login node

Any activities on a login node which noticably impacts the performance, reliability, or availability of a login node is considered unacceptible and may be subject to administrative termination.

  • genomic assembly or sequencing

  • simulations or models in StarCCM+, Ansys, COMSOL, Abaqus, Matlab, R, etc. which take more than 2-3 minutes to run

“Well, what should I be doing then?”

Get a job!

While each login node is a shared resource used by everyone who connects to a cluster, the resources allocated to you when you have a job running on a cluster are strictly for your use and yours alone. This means things will often run much faster because the processors and memory do not have to manage so much “context switching”.

The two main options for jobs are “batch” and “interactive”.

job type

batch

interactive

resource availability

same: follows cluster/node type policies

same: follows cluster/node type policies

process control

job script: sequence of precomposed commands in a file

commands entered at a shell prompt in real time

initiation

sbatch <filename>

interact --account=<slurm account> ...

duration

limited only by cluster policies for jobs

job ends when connection to login node ends unless started within a screen or tmux session; cluster policies also apply

best used for

task which run for a long time and need substantial resources

tasks which require frequent user interaction

Inspect your processes and impact on a login node:

htop

The htop process and utilization viewer is a commmand line tool which is great to show realtime information about your running processes. A login node will have many thousands of processes, so it’s helpful to limit the view to the processes you own, but be aware that the utilization meters you see takes into account ALL processes and not just yours:

htop -u $USER

The default display format of htop shows a meter for every core on the computer which can take up a lot of space on your display when there are 96-128 cores. You can change the layout interactively, or consider using the less intensive program top.

From within htop, you can view, sort, and even terminate (F9 - Kill) processes that you own.

ps

You can also you ps to list your all your current processes in a tree format:

ps jfU $USER

systemd-cgtop

The systemd-cgtop tool allows you to see the aggregate impact you’re having on a login node by showing you the number of tasks, total cpu utilization, and memory footprint (including cache):

systemd-cgtop user.slice/user-`id -u`.slice

“100%” = 1 full cpu core, and 225% means your processes have an equivalent impact of using 2.25 cpu cores in the most recent monitoring period.