Monitoring Nodes and Virtual Environments via Prometheus
You can monitor nodes running Virtuozzo Server 7.5 and newer as well as VEs hosted on them via Prometheus. A typical list of required components includes:
- Prometheus, a service that grabs statistics from exporters and stores it in a time series database.
- Alertmanager, a service that receives alerts from Prometheus and handles their delivery via various communication channels.
- Grafana, a service that provides a web panel with flexible dashboards and supports Prometheus as a data source.
- Exporters, services that are installed on Virtuozzo Server nodes and export metrics via a simple HTTP server.
This guide describes how to install the exporters, configure an existing Prometheus service, and import the dashboards to an existing Grafana panel. For details on installing Prometheus, Alertmanager, and Grafana, see the respective documentation:
- https://prometheus.io/docs/prometheus/latest/installation/
- https://prometheus.io/docs/alerting/latest/configuration/
- https://grafana.com/docs/grafana/latest/
Installing the Exporters
Perform these steps on each Virtuozzo Server 7.5 node that you want to monitor.
Install exporter packages:
1# yum install node_exporter libvirt_exporterConfigure the firewall:
1 2 3 4 5 6 7 8 9# firewall-cmd --permanent --zone=public --add-rich-rule='\ rule family="ipv4" \ source address="<prom_IP>/32" \ port protocol="tcp" port="9177" accept' # firewall-cmd --permanent --zone=public --add-rich-rule='\ rule family="ipv4" \ source address="<prom_IP>/32" \ port protocol="tcp" port="9100" accept' # firewall-cmd --reloadWhere
<prom_IP>is the Prometheus IP address, port 9177 is used by the libvirt exporter, and port 9100 is used by the node exporter.It is recommended to expose the metrics only to the Prometheus server. Unrestricted access to the metrics can be a security and stability risk.
To be able to monitor Virtuozzo Storage clients, open another port. For example:
1 2 3 4 5# firewall-cmd --permanent --zone=public --add-rich-rule='\ rule family="ipv4"\ source address="<prom_IP>/32"\ port protocol="tcp" port="9999" accept' # firewall-cmd --reloadLaunch the exporters:
1 2# systemctl start node_exporter # systemctl start libvirt-exporter
After setting up the exporters, on any Virtuozzo Server 7.5 node, obtain the sample configuration, rules, and alerts for Prometheus and dashboards for Grafana:
| |
The files will be placed in /usr/share/vz-prometheus-cfg/. For example:
| |
Configuring Prometheus
You will need to configure Prometheus so it can start collecting metrics from Virtuozzo Server nodes. To do this, modify prometheus.yml based on the sample prometheus-example.yml shipped with vz-prometheus-cfg.
Copy the rule and alert files to the Prometheus server and set their paths in
rule_filesinprometheus.yml(see the example further).Create target files that contain information about exporters you want to scrape. By using multiple target files you can group nodes by attributes like datacenter, cluster, and such. The following examples create a server group cluster1 populated with five nodes and scrape their node and libvirt exporters, respectively:
1 2 3 4 5 6 7 8 9# cat cluster1-libvirt.yml - labels: group: cluster1 targets: - hn01.cluster1.tld:9177 - hn02.cluster1.tld:9177 - hn03.cluster1.tld:9177 - hn04.cluster1.tld:9177 - hn05.cluster1.tld:91771 2 3 4 5 6 7 8 9# cat cluster1-nodes.yml - labels: group: cluster1 targets: - hn01.cluster1.tld:9100 - hn02.cluster1.tld:9100 - hn03.cluster1.tld:9100 - hn04.cluster1.tld:9100 - hn05.cluster1.tld:9100If these nodes are in the Virtuozzo Storage cluster, create a dedicated target file to be able to monitor the Virtuozzo Storage clients as well. For example:
1 2 3 4 5 6 7 8 9# cat cluster1.yml - labels: group: cluster1 targets: - hn01.cluster1.tld:9999 - hn02.cluster1.tld:9999 - hn03.cluster1.tld:9999 - hn04.cluster1.tld:9999 - hn05.cluster1.tld:9999Set paths to target files in
scrape_configsinprometheus.yml(see the example further).The Virtuozzo Storage job must be named
fused.A complete Prometheus configuration file may look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42# cat prometheus.yml global: scrape_interval: 1m evaluation_interval: 1m alerting: alertmanagers: - static_configs: - targets: - localhost:9093 rule_files: - /prometheus-<version>linux-amd64/rules/vz-rules.yml - /prometheus-<version>linux-amd64/rules/vstorage-rules.yml - /prometheus-<version>linux-amd64/alerts/vz-alerts.yml - /prometheus-<version>linux-amd64/alerts/vstorage-alerts.yml scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: libvirt relabel_configs: - source_labels: [__address__] target_label: instance regex: (.*)[:].+ file_sd_configs: - files: - /prometheus-<version>linux-amd64/targets/cluster1-libvirt.yml - job_name: node relabel_configs: - source_labels: [__address__] target_label: instance regex: (.*)[:].+ file_sd_configs: - files: - /prometheus-<version>linux-amd64/targets/cluster1-nodes.yml - job_name: fused relabel_configs: - source_labels: [__address__] target_label: instance regex: (.*)[:].+ file_sd_configs: - files: - /prometheus-<version>linux-amd64/targets/cluster1.ymlReplace
<version>with the actual version of Prometheus you are using.
After editing the Prometheus configuration file, restart the prometheus and alertmanager services.
To enable monitoring of Virtuozzo Storage clients listed in the target file:
Adjust
fstabon each node, using the previously chosen port. For example:1 2# cat /etc/fstab | grep ^vstorage vstorage://cluster1 /vstorage/cluster1 fuse.vstorage defaults,_netdev,prometheus=0.0.0.0:9999 0 0Stop all virtual environments and re-mount the Virtuozzo Storage file system on each node, one after another, so only one node is down at any given moment.
1 2# umount /vstorage/stor1 # mount /vstorage/stor1
Then start all virtual environments once again.
Alternatively, reboot each node one after another, so only one node is down at any given moment. Virtual environments are set to start on node reboots by default.
Configuring Grafana
To see the data collected from the nodes in Grafana, do the following in the Grafana web panel:
- If you have not already done so, add the configured Prometheus server as a data source:
- Navigate to Configuration -> Data Sources.
- Click Add data source, select Prometheus.
- Enter a name for the data source.
- Specify the Prometheus IP address and port.
- Click Save & Test.
- Import Virtuozzo Server dashboards. Perform these steps for each JSON file shipped with
vz-prometheus-cfg.- Navigate to Dashboards -> Manage.
- Click Import and Upload JSON file. Select a JSON file with a Grafana dashboard.
- Select the previously configured Prometheus data source.
- Click Import.
The Virtuozzo Server dashboards are now available in Dashboards Home.
Supported Alerts
The following alerts are supported for Virtuozzo Server.
| Alert | Severity | Description | What to do |
|---|---|---|---|
| nodeTcpListenDrops | Error | A large amount TCP packets have been dropped on the node. | Inspect the network traffic for issues and fix them. |
| nodeTcpRetransSegs | Error | A large amount of TCP packets have been retransmitted on the node. | |
| nodeOutOfMemory | Error | The node has run out of memory. | Find out what has been consuming memory on the node and fix it. If the node is overloaded due to overselling, migrate some of its virtual environments to other nodes. |
| nodeOutOfSwap | Critical | The node has run out of swap memory. | |
| nodeHighMemoryAllocationLatency | Warning | Node's memory allocation latency is too high. The node may be overloaded, resulting in unpredictable delays in operations. | |
| nodeRxChecksummingDisabled | Error | The rx-checksumming feature is disabled on the network interface. | These features are enabled by default. If they have been manually disabled, re-enable them for the network interface. |
| nodeTxChecksummingDisabled | Error | The tx-checksumming feature is disabled on the network interface. | |
| nodeScatterGatherDisabled | Error | The scatter-gather feature is disabled on the network interface. | |
| nodeTCPSegmentationOffloadDisabled | Error | The tcp-segmentation-offload feature is disabled on the network interface. | |
| nodeGenericSegmentationOffloadDisabled | Error | The generic-segmentation-offload feature is disabled on the network interface. | |
| nodeVzLicenseInactive | Critical | Node's Virtuozzo license is inactive. | Check and update the license. |
| guestPausedEIO | Critical | A virtual machine has been paused due to a disk I/O error. | Make sure that the node's partition where the VM's disks are stored has not run out of space. If it has, free up more space for the VM. If there is enough free space, evacuate the virtual machine (and any other critical data) from the physical disk it is stored on. Replace the physical disk as it is about to fail. |
| guestOsCrashed | Critical | A virtual machine has crashed after a BSOD or kernel panic in the guest OS. Collected for VMs only. | Find out the reasons for the crash, fix the VM, and restart any services that will not do so automatically. |
| nodeSMARTDiskError | Critical | A S.M.A.R.T counter for a node's disk is below threshold. | The node's disk is about to fail. Replace it as soon as possible. |
| nodeSMARTDiskWarning | Warning | A S.M.A.R.T counter for a node's disk is greater than zero. | Inspect the health of node's disk. You may need to replace it soon. |
| highCPUusage | Warning | Virtual environments's CPU usage has been over 90% for the last 10 minutes. The alert is disabled by default. | Check the virtual environment for potential problems, including software issues or malware. These alerts are disabled by default. To enable any of them, uncomment them in |
| highMemUsage | Warning | Virtual environments's memory usage has been over 95% for the last 10 minutes. The alert is disabled by default. | |
| cpuUsageIncrease | Warning | Virtual environments's CPU usage has greatly increased compared to the previous week. The alert is disabled by default. | |
| memUsageIncrease | Warning | Virtual environments's memory usage has greatly increased compared to the previous week. The alert is disabled by default. | |
| nodeHighDiskWriteLatency | Warning | Write operation latency for a node's disk has been too high for the last 10 seconds. | Find and fix the reason why the I/O requests have been taking so long. Reasons can be high I/O load or deterioration of the disk's health. |
| nodeHighDiskReadLatency | Warning | Read operation latency for a node's disk has been too high for the last 10 seconds. | |
| pendingKernelReboot | Warning | The node has been updated but not rebooted to the latest kernel. | Reboot the node and switch to the latest kernel. |
| lowPageCache | Warning | The node has high load average and very small page cache. The node is overloaded, possibly due to memory overcommitment. | Find out why the node is overloaded. If the node is overloaded due to overselling, migrate some of its virtual environments to other nodes. |
| highPfcacheUsage | Warning | Node's pfcache disk is 90% full and may run out of space. | Add more space to the pfcache disk or clean it up. |
| highVstorageMountLatency | Warning | The latency of vstorage-mount requests on a node has been too high for the last 10 seconds. | Check the health of the Virtuozzo Storage and its components. Fix the found issues. |
| slowIoRequest | Info | Virtual machine's I/O requests have been taking longer than 10 seconds. | Inspect the storage, be it local disks | or Virtuozzo Storage, for potential issues and fix them. | |
| unresponsiveBalloonDriver | Info | Virtual machine's balloon driver is not responding. The VM has stopped reporting its memory usage statistics and is not releasing node's memory automatically. | Find out what has happened to the virtio_balloon kernel module inside the VM. Reload the module. |