.. _Monitoring:

=========================
Monitoring Jobs and Nodes
=========================

.. contents:: :local:

After submitting a job, it is important to monitor the cluster utilization,
that is, the processor, memory, network and storage usage of the job.  
For this purpose, the users are allowed to login on the compute nodes 
they allocate.
To see which nodes are used by a job, one can use the Slurm command ``squeue``. 
After logging-in on a particular node, the processes can be 
monitored using the usual commands ``top``, ``htop`` and ``atop``.

A better way to monitor and explore the cluster status is to use the web application at
`it.fysik.su.se/hpc-moni`.
This site contains tools which enable users to interactively view 
the Slurm queues and jobs, the compute node status, and the overall cluster health. 

------------------

HPC-moni Application
====================

The HPC monitoring application is located at
`https://it.fysik.su.se/hpc-moni`.

.. figure:: images/grafana_main_menu.png
   :target: /hpc/_images/grafana_main_menu.png
   :align: center

The initial screen (homepage) shows a general information about the cluster, like
the queue and node status, or the ambient temperature and the current power
consumption.

The home page also serves as the main menu, with links to other pages
displaying a more detailed metrics.
The cluster metrics are obtained using `Prometheus <https://prometheus.io>`_
exporeters and shown in different panels using `Grafana <https://grafana.com>`_.
The panels are organized into four groups:

- SLURM

  - Job overview (Slurm ``squeue``)
  - Compute nodes status (Slurm ``sinfo``)
  - Slurm scheduler statistics

- NODES

  - CPU and memory statistics of node groups
  - Disk status (per node)
  - Node details (grouped by different hardware characteristics)

- ENVIRONMENT

  - Rack overview
  - Power distribution units (PDUs) current readings
  - Power consumption details
  - Temperature (ambient, CPU and disk)

- MAINTENANCE

  - Lustre file system statistics
  - S.M.A.R.T. disk status (available to admins only)
  - Alerts (available to admins only)

------------------

Slurm jobs
==========

The Slurm queue status can be obtained selecting one of the following links:

.. figure:: images/grafana_main_menu1.png
   :target: /hpc/_images/grafana_main_menu1.png
   :align: center

Jobs overview
-------------

The Slurm job overview page shows information about the running/completed/pending jobs
and the allocated/completing/idle nodes. It also shows the running and
pending jobs over time and the table with the current running and pending jobs.

.. figure:: images/grafana_slurm_jobs.png
   :target: /hpc/_images/grafana_slurm_jobs.png
   :align: center

To take a closer look at some jobs, one can zoom-in the specific time interval. 

.. figure:: images/grafana_slurm_jobs_zoom.png
   :target: /hpc/_images/grafana_slurm_jobs_zoom.png
   :align: center

Compute nodes
-------------

This page shows the current compute node status. Nodes belonging to
differnt jobs are highlighted in different colors.
Clicking on a node leads to the `Node Details` page, which shows
the cpu and memory usage of the node.

.. figure:: images/grafana_slurm_nodes.png
   :target: /hpc/_images/grafana_slurm_nodes.png
   :align: center

.. note::
   Clicking on a job listed in the `Current Jobs` table found at the `Jobs Overview` 
   page shows the `Compute Nodes` page with the allocations at the specific time.

Scheduler statistics
--------------------

The scheduler statistics page 
shows additional Slurm information, like the number of failed jobs and nodes.

.. figure:: images/grafana_slurm_sched1.png
   :target: /hpc/_images/grafana_slurm_sched1.png
   :align: center

At the end, this page shows the scheduler metrics:
the number of scheduler threads and the backfill depth together with 
the scheduler and backfill scheduler cycles.

.. figure:: images/grafana_slurm_sched2.png
   :target: /hpc/_images/grafana_slurm_sched2.png
   :align: center

------------------

Compute nodes
=============

The node performance details can be obtained selecting one of the nodes
in the `Nodes` menu group 
(the compute nodes are grouped by hardware type). 

.. figure:: images/grafana_main_menu2.png
   :target: /hpc/_images/grafana_main_menu2.png
   :align: center

Node details
------------

The node details page shows to which job the node belongs, and the detailed 
cpu and memory usage of the node. Good jobs keep the nodes as busy as possible 
(with high cpu and memory usage) without choking them (the cpu usage should be 
100% by a user and the memory should not enter the swap).

.. figure:: images/grafana_node_details.png
   :target: /hpc/_images/grafana_node_details.png
   :align: center

CPU and Memory statistics
-------------------------

The `CPU and Memory stats` page shows the the overall statistics in the specific group.
Here one can also explore the head and file system nodes.

.. figure:: images/grafana_node_cpustat.png
   :target: /hpc/_images/grafana_node_cpustat.png
   :align: center

Disk status
-----------

The `Disks status` page shows the storage information for the selected node.

.. figure:: images/grafana_node_disks.png
   :target: /hpc/_images/grafana_node_disks.png
   :align: center

------------------

Environment
===========

The `Environment` panels show the physical rack layout, the current
temperature (ambient, CPU and disk), and the eletrical power consumption.

.. figure:: images/grafana_main_menu3.png
   :target: /hpc/_images/grafana_main_menu3.png
   :align: center

Rack overview
-------------

The rack overview shows the power distribution unit (PDU) readings,
the average ambient temperature (at the top and bottom nodes in the rack), and
the current node load (shown in parentheses).

.. figure:: images/grafana_rack_overview.png
   :target: /hpc/_images/grafana_rack_overview.png
   :align: center

PDUs readings
-------------

The PDUs page shows the electrical current readings 
aggregated per PDU and phase.

.. figure:: images/grafana_pdus.png
   :target: /hpc/_images/grafana_pdus.png
   :align: center

Power consumption
-----------------

The power consumption page shows the detailed electrical power usage
of the cluster over time.

.. figure:: images/grafana_power.png
   :target: /hpc/_images/grafana_power.png
   :align: center

Temperature
-----------

The ambient, CPU and disk temperature readings are shown 
for different nodes grouped by hardware type.

.. figure:: images/grafana_temp.png
   :target: /hpc/_images/grafana_temp.png
   :align: center

------------------

Maintenance
===========

The maintenance panels are meant to be used by the system administrators. 
They show the Lustre file system statistics,
S.M.A.R.T. disk status,
and the overall cluster health (number of failed nodes and similar).

.. figure:: images/grafana_main_menu4.png
   :target: /hpc/_images/grafana_main_menu4.png
   :align: center

Lustre status
-------------

.. figure:: images/grafana_lustre1.png
   :target: /hpc/_images/grafana_lustre1.png
   :align: center

.. figure:: images/grafana_lustre2.png
   :target: /hpc/_images/grafana_lustre2.png
   :align: center

------------------

Accessing historical data
=========================

All the panels show the latest statitcs by default. 
Also, the web pages are rendered in the so-called *kiosk* mode without the Grafana menus.
Pressing ``Esc`` exposses these additional menus. One particular
menu enables a user to select the time interval for the shown statistics
(encircled in red below):

.. figure:: images/grafana_main_menu_tpicker.png
   :target: /hpc/_images/grafana_main_menu_tpicker.png
   :align: center

.. figure:: images/grafana_main_menu_hist.png
   :target: /hpc/_images/grafana_main_menu_hist.png
   :align: center

After selecting a time range, navigating to other links will show the metrics
for the specific time (note the selected range on the following
two figures). This is helpful in examining how the completed jobs 
had been perfoming historicaly.

.. figure:: images/grafana_slurm_nodes_hist.png
   :target: /hpc/_images/grafana_slurm_nodes_hist.png
   :align: center

.. figure:: images/grafana_node_details_hist.png
   :target: /hpc/_images/grafana_node_details_hist.png
   :align: center

An alternative way to select the time range of interest is
to drag a region in any graph in any Slurm or Node panel,
like in the running jobs panel below. After that, clicking
on a job in the "current" jobs, will show the Slurm node and
job status in the selected time range.

.. figure:: images/grafana_select_interval.png
   :target: /hpc/_images/grafana_select_interval.png
   :align: center

.. figure:: images/grafana_select_interval2.png
   :target: /hpc/_images/grafana_select_interval2.png
   :align: center

.. figure:: images/grafana_select_interval3.png
   :target: /hpc/_images/grafana_select_interval3.png
   :align: center