Inappropriate Use of Login Nodes

What is a login node?

The login nodes on a compute cluster are a shared resource. They need to be readily available for numerous tasks and see steady use by a constant stream of ARC researchers and students. It is very common to see 40-60 simultaneous login sessions at any given time on any ARC cluster login node.

A login node is sometime referred to as a “front-end” or “head” node to evoke the sense that they are system users will access as an entry point to the computational clusters. In contrast, the compute nodes are the computational workhorses of the clusters, but are not directly accessible to users outside the context of a running job.

The Tinkercliffs cluster has two login nodes: tinkercliffs1.arc.vt.edu and tinkercliffs2.arc.vt.edu
The Owl cluster has three login nodes: owl1.arc.vt.edu, owl2.arc.vt.edu, and owl3.arc.vt.edu
The Falcon cluster has two login nodes: falcon1.arc.vt.edu and falcon2.arc.vt.edu
The CUI cluster has one login node: cui1.arc.vt.edu
The Biomed cluster has one login node: biomed1.arc.vt.edu

Acceptible use of a login node

Normal usage of a login node includes activities like

Composing or editing a job script with a text editor like nano, vi, or emacs
Submitting jobs to the scheduler and monitoring the status of jobs using commands like sbatch, squeue, and sacct
Organizing files for job or viewing the output from a job
Initiating an interactive job to get a shell on a compute node using interact

Examples activities which are sometimes okay and sometimes are abusive

There is a significant “gray area” of workloads which are okay to run on login nodes in some cases, but are unacceptible in other cases. The deciding factor is always the impact they have on the login node. As a rule-of-thumb, if an intensive task will run for more than 2-3 minutes, it should probably be running on a compute node as part of a job.

Compiling software or building python virtual environments
Compressing or decompressing datasets
Transferring data to/from clusters: ARC hosts a Globus and data transfer node datatransfer.arc.vt.edu which provides better network bandwidth and will not impact any login nodes.

Computations on the login nodes are restricted because the login host is shared among many users at once for access. The code is CPU-bound to only 8 cores and limited to only 32GB of RAM. If the workflow benefits from scaling up resources, it should probably be moved to a compute node.

Unacceptable use of a login node

Any activities on a login node which noticeably impacts the performance, reliability, or availability of a login node is considered unacceptable. Some examples follow:

Genomic assembly or sequencing
Simulations or models in StarCCM+, Ansys, COMSOL, Abaqus, Matlab, R, etc. which take more than 2-3 minutes to run
Python: (1) scripts intended to use GPU acceleration on a GPU node; (2) running any python code; (3) creating virtual environments.

Ramifications of inappropriate use of login nodes

ARC team members may contact you and ask that you kill processes (see below for how to do this).

Inappropriate use of login nodes may result in administrative termination of processes.

Your attempts to login to a head node via ssh do not succeed (i.e., do not connect) and appear to hang indefinitely. This occurs when a user’s entire cgroup quota on a login node is full due to using 8 full cpus.

“Well, what should I be doing then?”

Submit a job

While each login node is a shared resource used by everyone who connects to a cluster, the resources allocated to you when you have a job running on a cluster are strictly for your use and yours alone. This means things will often run much faster because the processors and memory do not have to manage so much “context switching”.

The two main options for jobs are “batch” and “interactive”.

Job type	Batch	Interactive
Resource availability	same: follows cluster/node type policies	same: follows cluster/node type policies
Process control	job script: sequence of precomposed commands in a file	commands entered at a shell prompt in real time
Initiation	`sbatch <filename>`	`interact --account=<slurm account> ...`
Duration	limited only by cluster policies for jobs	job ends when connection to login node ends unless started within a `screen` or `tmux` session; cluster policies also apply
Best used for	task which run for a long time and need substantial resources	tasks which require frequent user interaction

Inspect your processes and impact on a login node:

`htop`

The htop process and utilization viewer is a commmand line tool which is great to show realtime information about your running processes. A login node will have many thousands of processes, so it’s helpful to limit the view to the processes you own, but be aware that the utilization meters you see takes into account ALL processes and not just yours:

htop -u $USER

The default display format of htop shows a horizontal bar graph for every core on the computer which can take up a lot of space on your display when there are 96-128 cores. You can change the layout interactively, or consider using the less intensive program top.

From within htop, you can view, sort, and even terminate (F9 - Kill) processes that you own.

`ps`

You can also you ps to list your all your current processes in a tree format:

ps jfU $USER

`systemd-cgtop`

The systemd-cgtop tool allows you to see the aggregate impact you’re having on a login node by showing you the number of tasks, total cpu utilization, and memory footprint (including cache):

systemd-cgtop user.slice/user-`id -u`.slice

“100%” = 1 full cpu core, and 225% means your processes have an equivalent impact of using 2.25 cpu cores in the most recent monitoring period.

How to stop processes (that are running on a login node):

Surgical options: kill one process at a time.

Kill one process at a time based on the process ID—you get the process ID from a command such as htop, ps, and top (top is invoked like htop) above.

First, use the -15 switch which is a gentler form:

kill -15 PID

where PID is the process ID.

If you repeat the htop command and the process is no longer listed, then the process has been killed. If the process is still listed, then it is still running. So then do:

kill -9 PID

The -9 flag will cause the process with ID PID to die.

Kill multiple processes with one command.

The pkill command (process kill) will also terminate processes but you can specify options. Since there are options that result in multiple processes being killed, be sure you check your commands before running them. Here are a couple of examples.

To kill all processes that you are using to inspect files with the vi text editor:

pkill vi

To kill the most recently started (i.e., newest) instance of a python invocation:

pkill -n python

To kill the first started (i.e., oldest) instance of a python code:

pkill -o python

To kill all processes that are running python with a particular command line argument (let us assume the invocation includes “python config_file_01.inp”):

pkill -f 'python config_file_01.inp'