Monitoring Jobs and Nodes
After submitting a job, it is important to monitor the cluster utilization,
that is, the processor, memory, network and storage usage of the job.
For this purpose, the users are allowed to login on the compute nodes
they allocate.
To see which nodes are used by a job, one can use the Slurm command squeue
.
After logging-in on a particular node, the processes can be
monitored using the usual commands top
, htop
and atop
.
A better way to monitor and explore the cluster status is to use the web application at it.fysik.su.se/hpc-moni. This site contains tools which enable users to interactively view the Slurm queues and jobs, the compute node status, and the overall cluster health.
HPC-moni Application
The HPC monitoring application is located at https://it.fysik.su.se/hpc-moni.
data:image/s3,"s3://crabby-images/ccaaa/ccaaa7760d2f84439f8698217ae770ecde5f42da" alt="_images/grafana_main_menu.png"
The initial screen (homepage) shows a general information about the cluster, like the queue and node status, or the ambient temperature and the current power consumption.
The home page also serves as the main menu, with links to other pages displaying a more detailed metrics. The cluster metrics are obtained using Prometheus exporeters and shown in different panels using Grafana. The panels are organized into four groups:
SLURM
Job overview (Slurm
squeue
)Compute nodes status (Slurm
sinfo
)Slurm scheduler statistics
NODES
CPU and memory statistics of node groups
Disk status (per node)
Node details (grouped by different hardware characteristics)
ENVIRONMENT
Rack overview
Power distribution units (PDUs) current readings
Power consumption details
Temperature (ambient, CPU and disk)
MAINTENANCE
Lustre file system statistics
S.M.A.R.T. disk status (available to admins only)
Alerts (available to admins only)
Slurm jobs
The Slurm queue status can be obtained selecting one of the following links:
data:image/s3,"s3://crabby-images/49ee0/49ee043c4eca979a13040d4358af6c7a1153f011" alt="_images/grafana_main_menu1.png"
Jobs overview
The Slurm job overview page shows information about the running/completed/pending jobs and the allocated/completing/idle nodes. It also shows the running and pending jobs over time and the table with the current running and pending jobs.
data:image/s3,"s3://crabby-images/ef524/ef52461ccf9506bd58229db435137c306eea3773" alt="_images/grafana_slurm_jobs.png"
To take a closer look at some jobs, one can zoom-in the specific time interval.
data:image/s3,"s3://crabby-images/d128b/d128bccf7dc0a18d2a865215e67d4ef4035a77aa" alt="_images/grafana_slurm_jobs_zoom.png"
Compute nodes
This page shows the current compute node status. Nodes belonging to differnt jobs are highlighted in different colors. Clicking on a node leads to the Node Details page, which shows the cpu and memory usage of the node.
data:image/s3,"s3://crabby-images/3c7e5/3c7e57865426ea2d2f01e17552f6af4c70149ce2" alt="_images/grafana_slurm_nodes.png"
Note
Clicking on a job listed in the Current Jobs table found at the Jobs Overview page shows the Compute Nodes page with the allocations at the specific time.
Scheduler statistics
The scheduler statistics page shows additional Slurm information, like the number of failed jobs and nodes.
data:image/s3,"s3://crabby-images/4e43a/4e43a0a42f59fe72693cdc4c020b1f1f56b716bc" alt="_images/grafana_slurm_sched1.png"
At the end, this page shows the scheduler metrics: the number of scheduler threads and the backfill depth together with the scheduler and backfill scheduler cycles.
data:image/s3,"s3://crabby-images/70600/706001ab83cda5e2572d7f42b5780743294d3518" alt="_images/grafana_slurm_sched2.png"
Compute nodes
The node performance details can be obtained selecting one of the nodes in the Nodes menu group (the compute nodes are grouped by hardware type).
data:image/s3,"s3://crabby-images/24cca/24ccae0c2aca7244388e4e86aa1c0dae12c4fab4" alt="_images/grafana_main_menu2.png"
Node details
The node details page shows to which job the node belongs, and the detailed cpu and memory usage of the node. Good jobs keep the nodes as busy as possible (with high cpu and memory usage) without choking them (the cpu usage should be 100% by a user and the memory should not enter the swap).
data:image/s3,"s3://crabby-images/9cc08/9cc0898a63b4b86b9fed4d7e793a5de0f7428998" alt="_images/grafana_node_details.png"
CPU and Memory statistics
The CPU and Memory stats page shows the the overall statistics in the specific group. Here one can also explore the head and file system nodes.
data:image/s3,"s3://crabby-images/af36d/af36d35a1c63848b09adff3a9783bd4298db086c" alt="_images/grafana_node_cpustat.png"
Disk status
The Disks status page shows the storage information for the selected node.
data:image/s3,"s3://crabby-images/9f6d0/9f6d033fdcfd8f43dbdae0b5f9845d8d06457a15" alt="_images/grafana_node_disks.png"
Environment
The Environment panels show the physical rack layout, the current temperature (ambient, CPU and disk), and the eletrical power consumption.
data:image/s3,"s3://crabby-images/30302/30302791141655c1fa7d4e86c3da4a27305aa7d8" alt="_images/grafana_main_menu3.png"
Rack overview
The rack overview shows the power distribution unit (PDU) readings, the average ambient temperature (at the top and bottom nodes in the rack), and the current node load (shown in parentheses).
data:image/s3,"s3://crabby-images/ab7cf/ab7cf3395f2ed457955bdae83589d974e7f56f6a" alt="_images/grafana_rack_overview.png"
PDUs readings
The PDUs page shows the electrical current readings aggregated per PDU and phase.
data:image/s3,"s3://crabby-images/a5a0b/a5a0bbdeadbeac5e31e5719b5c59949cc49e354b" alt="_images/grafana_pdus.png"
Power consumption
The power consumption page shows the detailed electrical power usage of the cluster over time.
data:image/s3,"s3://crabby-images/3f3b2/3f3b256d065bf551c321793a6508681b4e9331d2" alt="_images/grafana_power.png"
Temperature
The ambient, CPU and disk temperature readings are shown for different nodes grouped by hardware type.
data:image/s3,"s3://crabby-images/fc131/fc131c2b3b882f999c1e7075ae0eadda7a5c5f05" alt="_images/grafana_temp.png"
Maintenance
The maintenance panels are meant to be used by the system administrators. They show the Lustre file system statistics, S.M.A.R.T. disk status, and the overall cluster health (number of failed nodes and similar).
data:image/s3,"s3://crabby-images/0f5bb/0f5bbf60bf285a0870af5f3805679286094b030b" alt="_images/grafana_main_menu4.png"
Lustre status
data:image/s3,"s3://crabby-images/87fab/87fab4d8183995924160f0fe8a57d55245052b57" alt="_images/grafana_lustre1.png"
data:image/s3,"s3://crabby-images/8f4ff/8f4ff8285e41be50424ddcfb55a72fdea47347ac" alt="_images/grafana_lustre2.png"
Accessing historical data
All the panels show the latest statitcs by default.
Also, the web pages are rendered in the so-called kiosk mode without the Grafana menus.
Pressing Esc
exposses these additional menus. One particular
menu enables a user to select the time interval for the shown statistics
(encircled in red below):
data:image/s3,"s3://crabby-images/4312c/4312c04ccd750741fef7dc423d9c83a62ede08d3" alt="_images/grafana_main_menu_tpicker.png"
data:image/s3,"s3://crabby-images/0c184/0c1841e57f59823ada5d5c5d02894f602228c7bb" alt="_images/grafana_main_menu_hist.png"
After selecting a time range, navigating to other links will show the metrics for the specific time (note the selected range on the following two figures). This is helpful in examining how the completed jobs had been perfoming historicaly.
data:image/s3,"s3://crabby-images/7e6fe/7e6fe3dd591b6d937bf162fd10004161548ebe04" alt="_images/grafana_slurm_nodes_hist.png"
data:image/s3,"s3://crabby-images/03e9d/03e9d9d353c927d719f858ee62e5d6bd7880454a" alt="_images/grafana_node_details_hist.png"
An alternative way to select the time range of interest is to drag a region in any graph in any Slurm or Node panel, like in the running jobs panel below. After that, clicking on a job in the “current” jobs, will show the Slurm node and job status in the selected time range.
data:image/s3,"s3://crabby-images/b13cb/b13cb2ea013b9fbb24595d243a584919cacf1bfd" alt="_images/grafana_select_interval.png"
data:image/s3,"s3://crabby-images/c7a38/c7a387d4cb38d77b6496c4199ea6ea4ccf00ec70" alt="_images/grafana_select_interval2.png"
data:image/s3,"s3://crabby-images/42b6b/42b6bbb1a8eaf9a0be6405dc3b1b3c1c6fad4021" alt="_images/grafana_select_interval3.png"