.. _Monitoring: ========================= Monitoring Jobs and Nodes ========================= .. contents:: :local: After submitting a job, it is important to monitor the cluster utilization, that is, the processor, memory, network and storage usage of the job. For this purpose, the users are allowed to login on the compute nodes they allocate. To see which nodes are used by a job, one can use the Slurm command ``squeue``. After logging-in on a particular node, the processes can be monitored using the usual commands ``top``, ``htop`` and ``atop``. A better way to monitor and explore the cluster status is to use the web application at `it.fysik.su.se/hpc-moni`. This site contains tools which enable users to interactively view the Slurm queues and jobs, the compute node status, and the overall cluster health. ------------------ HPC-moni Application ==================== The HPC monitoring application is located at `https://it.fysik.su.se/hpc-moni`. .. figure:: images/grafana_main_menu.png :target: /hpc/_images/grafana_main_menu.png :align: center The initial screen (homepage) shows a general information about the cluster, like the queue and node status, or the ambient temperature and the current power consumption. The home page also serves as the main menu, with links to other pages displaying a more detailed metrics. The cluster metrics are obtained using `Prometheus `_ exporeters and shown in different panels using `Grafana `_. The panels are organized into four groups: - SLURM - Job overview (Slurm ``squeue``) - Compute nodes status (Slurm ``sinfo``) - Slurm scheduler statistics - NODES - CPU and memory statistics of node groups - Disk status (per node) - Node details (grouped by different hardware characteristics) - ENVIRONMENT - Rack overview - Power distribution units (PDUs) current readings - Power consumption details - Temperature (ambient, CPU and disk) - MAINTENANCE - Lustre file system statistics - S.M.A.R.T. disk status (available to admins only) - Alerts (available to admins only) ------------------ Slurm jobs ========== The Slurm queue status can be obtained selecting one of the following links: .. figure:: images/grafana_main_menu1.png :target: /hpc/_images/grafana_main_menu1.png :align: center Jobs overview ------------- The Slurm job overview page shows information about the running/completed/pending jobs and the allocated/completing/idle nodes. It also shows the running and pending jobs over time and the table with the current running and pending jobs. .. figure:: images/grafana_slurm_jobs.png :target: /hpc/_images/grafana_slurm_jobs.png :align: center To take a closer look at some jobs, one can zoom-in the specific time interval. .. figure:: images/grafana_slurm_jobs_zoom.png :target: /hpc/_images/grafana_slurm_jobs_zoom.png :align: center Compute nodes ------------- This page shows the current compute node status. Nodes belonging to differnt jobs are highlighted in different colors. Clicking on a node leads to the `Node Details` page, which shows the cpu and memory usage of the node. .. figure:: images/grafana_slurm_nodes.png :target: /hpc/_images/grafana_slurm_nodes.png :align: center .. note:: Clicking on a job listed in the `Current Jobs` table found at the `Jobs Overview` page shows the `Compute Nodes` page with the allocations at the specific time. Scheduler statistics -------------------- The scheduler statistics page shows additional Slurm information, like the number of failed jobs and nodes. .. figure:: images/grafana_slurm_sched1.png :target: /hpc/_images/grafana_slurm_sched1.png :align: center At the end, this page shows the scheduler metrics: the number of scheduler threads and the backfill depth together with the scheduler and backfill scheduler cycles. .. figure:: images/grafana_slurm_sched2.png :target: /hpc/_images/grafana_slurm_sched2.png :align: center ------------------ Compute nodes ============= The node performance details can be obtained selecting one of the nodes in the `Nodes` menu group (the compute nodes are grouped by hardware type). .. figure:: images/grafana_main_menu2.png :target: /hpc/_images/grafana_main_menu2.png :align: center Node details ------------ The node details page shows to which job the node belongs, and the detailed cpu and memory usage of the node. Good jobs keep the nodes as busy as possible (with high cpu and memory usage) without choking them (the cpu usage should be 100% by a user and the memory should not enter the swap). .. figure:: images/grafana_node_details.png :target: /hpc/_images/grafana_node_details.png :align: center CPU and Memory statistics ------------------------- The `CPU and Memory stats` page shows the the overall statistics in the specific group. Here one can also explore the head and file system nodes. .. figure:: images/grafana_node_cpustat.png :target: /hpc/_images/grafana_node_cpustat.png :align: center Disk status ----------- The `Disks status` page shows the storage information for the selected node. .. figure:: images/grafana_node_disks.png :target: /hpc/_images/grafana_node_disks.png :align: center ------------------ Environment =========== The `Environment` panels show the physical rack layout, the current temperature (ambient, CPU and disk), and the eletrical power consumption. .. figure:: images/grafana_main_menu3.png :target: /hpc/_images/grafana_main_menu3.png :align: center Rack overview ------------- The rack overview shows the power distribution unit (PDU) readings, the average ambient temperature (at the top and bottom nodes in the rack), and the current node load (shown in parentheses). .. figure:: images/grafana_rack_overview.png :target: /hpc/_images/grafana_rack_overview.png :align: center PDUs readings ------------- The PDUs page shows the electrical current readings aggregated per PDU and phase. .. figure:: images/grafana_pdus.png :target: /hpc/_images/grafana_pdus.png :align: center Power consumption ----------------- The power consumption page shows the detailed electrical power usage of the cluster over time. .. figure:: images/grafana_power.png :target: /hpc/_images/grafana_power.png :align: center Temperature ----------- The ambient, CPU and disk temperature readings are shown for different nodes grouped by hardware type. .. figure:: images/grafana_temp.png :target: /hpc/_images/grafana_temp.png :align: center ------------------ Maintenance =========== The maintenance panels are meant to be used by the system administrators. They show the Lustre file system statistics, S.M.A.R.T. disk status, and the overall cluster health (number of failed nodes and similar). .. figure:: images/grafana_main_menu4.png :target: /hpc/_images/grafana_main_menu4.png :align: center Lustre status ------------- .. figure:: images/grafana_lustre1.png :target: /hpc/_images/grafana_lustre1.png :align: center .. figure:: images/grafana_lustre2.png :target: /hpc/_images/grafana_lustre2.png :align: center ------------------ Accessing historical data ========================= All the panels show the latest statitcs by default. Also, the web pages are rendered in the so-called *kiosk* mode without the Grafana menus. Pressing ``Esc`` exposses these additional menus. One particular menu enables a user to select the time interval for the shown statistics (encircled in red below): .. figure:: images/grafana_main_menu_tpicker.png :target: /hpc/_images/grafana_main_menu_tpicker.png :align: center .. figure:: images/grafana_main_menu_hist.png :target: /hpc/_images/grafana_main_menu_hist.png :align: center After selecting a time range, navigating to other links will show the metrics for the specific time (note the selected range on the following two figures). This is helpful in examining how the completed jobs had been perfoming historicaly. .. figure:: images/grafana_slurm_nodes_hist.png :target: /hpc/_images/grafana_slurm_nodes_hist.png :align: center .. figure:: images/grafana_node_details_hist.png :target: /hpc/_images/grafana_node_details_hist.png :align: center An alternative way to select the time range of interest is to drag a region in any graph in any Slurm or Node panel, like in the running jobs panel below. After that, clicking on a job in the "current" jobs, will show the Slurm node and job status in the selected time range. .. figure:: images/grafana_select_interval.png :target: /hpc/_images/grafana_select_interval.png :align: center .. figure:: images/grafana_select_interval2.png :target: /hpc/_images/grafana_select_interval2.png :align: center .. figure:: images/grafana_select_interval3.png :target: /hpc/_images/grafana_select_interval3.png :align: center