Monitoring Nodes and Virtual Environments via Prometheus

The collected statistics are only intended for monitoring and are not suitable for billing purposes.

You can monitor nodes running Virtuozzo Server 7.5 and newer as well as VEs hosted on them via Prometheus. A typical list of required components includes:

Prometheus, a service that grabs statistics from exporters and stores it in a time series database.
Alertmanager, a service that receives alerts from Prometheus and handles their delivery via various communication channels.
Grafana, a service that provides a web panel with flexible dashboards and supports Prometheus as a data source.
Exporters, services that are installed on Virtuozzo Server nodes and export metrics via a simple HTTP server.

This guide describes how to install the exporters, configure an existing Prometheus service, and import the dashboards to an existing Grafana panel. For details on installing Prometheus, Alertmanager, and Grafana, see the respective documentation:

Installing the Exporters

Perform these steps on each Virtuozzo Server 7.5 node that you want to monitor.

Install exporter packages:

1
# yum install node_exporter libvirt_exporter

Configure the firewall:

1
2
3
4
5
6
7
8
9
# firewall-cmd --permanent --zone=public --add-rich-rule='\
rule family="ipv4" \
source address="<prom_IP>/32" \
port protocol="tcp" port="9177" accept'
# firewall-cmd --permanent --zone=public --add-rich-rule='\
rule family="ipv4" \
source address="<prom_IP>/32" \
port protocol="tcp" port="9100" accept'
# firewall-cmd --reload

Where <prom_IP> is the Prometheus IP address, port 9177 is used by the libvirt exporter, and port 9100 is used by the node exporter.

It is recommended to expose the metrics only to the Prometheus server. Unrestricted access to the metrics can be a security and stability risk.

To be able to monitor Virtuozzo Storage clients, open another port. For example:

1
2
3
4
5
# firewall-cmd --permanent --zone=public --add-rich-rule='\
rule family="ipv4"\
source address="<prom_IP>/32"\
port protocol="tcp" port="9999" accept'
# firewall-cmd --reload

Launch the exporters:

1
2
# systemctl start node_exporter
# systemctl start libvirt-exporter

After setting up the exporters, on any Virtuozzo Server 7.5 node, obtain the sample configuration, rules, and alerts for Prometheus and dashboards for Grafana:

1
# yum install vz-prometheus-cfg

The files will be placed in /usr/share/vz-prometheus-cfg/. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# tree /usr/share/vz-prometheus-cfg/
/usr/share/vz-prometheus-cfg/
├── alerts
│   ├── vstorage-alerts.yml
│   └── vz-alerts.yml
├── dashboards
│   ├── grafana_hn_dashboard.json
│   ├── grafana_ve_dashboard.json
│   ├── grafana_win_ct_hn_dashboard.json
│   └── grafana_win_ct_ve_dashboard.json
├── prometheus-example.yml
├── rules
│   ├── vstorage-rules.yml
│   ├── vz-rules.yml
│   └── win_ct-rules.yml
└── targets
    ├── targets-example.yml
    └── vstorage-targets-example.yml

Configuring Prometheus

You will need to configure Prometheus so it can start collecting metrics from Virtuozzo Server nodes. To do this, modify prometheus.yml based on the sample prometheus-example.yml shipped with vz-prometheus-cfg.

Copy the rule and alert files to the Prometheus server and set their paths in rule_files in prometheus.yml (see the example further).

Create target files that contain information about exporters you want to scrape. By using multiple target files you can group nodes by attributes like datacenter, cluster, and such. The following examples create a server group cluster1 populated with five nodes and scrape their node and libvirt exporters, respectively:

1
2
3
4
5
6
7
8
9
# cat cluster1-libvirt.yml
- labels:
    group: cluster1
  targets:
  - hn01.cluster1.tld:9177
  - hn02.cluster1.tld:9177
  - hn03.cluster1.tld:9177
  - hn04.cluster1.tld:9177
  - hn05.cluster1.tld:9177

1
2
3
4
5
6
7
8
9
# cat cluster1-nodes.yml
- labels:
    group: cluster1
  targets:
  - hn01.cluster1.tld:9100
  - hn02.cluster1.tld:9100
  - hn03.cluster1.tld:9100
  - hn04.cluster1.tld:9100
  - hn05.cluster1.tld:9100

If these nodes are in the Virtuozzo Storage cluster, create a dedicated target file to be able to monitor the Virtuozzo Storage clients as well. For example:

1
2
3
4
5
6
7
8
9
# cat cluster1.yml
- labels:
    group: cluster1
  targets:
  - hn01.cluster1.tld:9999
  - hn02.cluster1.tld:9999
  - hn03.cluster1.tld:9999
  - hn04.cluster1.tld:9999
  - hn05.cluster1.tld:9999

Set paths to target files in scrape_configs in prometheus.yml (see the example further).

The Virtuozzo Storage job must be named fused.

A complete Prometheus configuration file may look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# cat prometheus.yml
global:
  scrape_interval:     1m
  evaluation_interval: 1m
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
rule_files:
  - /prometheus-<version>linux-amd64/rules/vz-rules.yml
  - /prometheus-<version>linux-amd64/rules/vstorage-rules.yml
  - /prometheus-<version>linux-amd64/alerts/vz-alerts.yml
  - /prometheus-<version>linux-amd64/alerts/vstorage-alerts.yml
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: libvirt
    relabel_configs:
    - source_labels: [__address__]
      target_label: instance
      regex: (.*)[:].+
    file_sd_configs:
      - files:
        - /prometheus-<version>linux-amd64/targets/cluster1-libvirt.yml
  - job_name: node
    relabel_configs:
    - source_labels: [__address__]
      target_label: instance
      regex: (.*)[:].+
    file_sd_configs:
      - files:
        - /prometheus-<version>linux-amd64/targets/cluster1-nodes.yml
  - job_name: fused
    relabel_configs:
    - source_labels: [__address__]
      target_label: instance
      regex: (.*)[:].+
    file_sd_configs:
      - files:
        - /prometheus-<version>linux-amd64/targets/cluster1.yml

Replace <version> with the actual version of Prometheus you are using.

After editing the Prometheus configuration file, restart the prometheus and alertmanager services.

To enable monitoring of Virtuozzo Storage clients listed in the target file:

Adjust fstab on each node, using the previously chosen port. For example:
1 2
# cat /etc/fstab | grep ^vstorage vstorage://cluster1 /vstorage/cluster1 fuse.vstorage defaults,_netdev,prometheus=0.0.0.0:9999 0 0
1. Stop all virtual environments and re-mount the Virtuozzo Storage file system on each node, one after another, so only one node is down at any given moment.
  1 2
  # umount /vstorage/stor1 # mount /vstorage/stor1
Then start all virtual environments once again.
Alternatively, reboot each node one after another, so only one node is down at any given moment. Virtual environments are set to start on node reboots by default.

Configuring Grafana

To see the data collected from the nodes in Grafana, do the following in the Grafana web panel:

If you have not already done so, add the configured Prometheus server as a data source:
1. Navigate to Configuration -> Data Sources.
2. Click Add data source, select Prometheus.
3. Enter a name for the data source.
4. Specify the Prometheus IP address and port.
5. Click Save & Test.
Import Virtuozzo Server dashboards. Perform these steps for each JSON file shipped with vz-prometheus-cfg.
1. Navigate to Dashboards -> Manage.
2. Click Import and Upload JSON file. Select a JSON file with a Grafana dashboard.
3. Select the previously configured Prometheus data source.
4. Click Import.

The Virtuozzo Server dashboards are now available in Dashboards Home.

Supported Alerts

The following alerts are supported for Virtuozzo Server.

Alert	Severity	Description	What to do
nodeTcpListenDrops	Error	A large amount TCP packets have been dropped on the node.	Inspect the network traffic for issues and fix them.
nodeTcpRetransSegs	Error	A large amount of TCP packets have been retransmitted on the node.	Inspect the network traffic for issues and fix them.
nodeOutOfMemory	Error	The node has run out of memory.	Find out what has been consuming memory on the node and fix it. If the node is overloaded due to overselling, migrate some of its virtual environments to other nodes.
nodeOutOfSwap	Critical	The node has run out of swap memory.
nodeHighMemoryAllocationLatency	Warning	Node's memory allocation latency is too high. The node may be overloaded, resulting in unpredictable delays in operations.
nodeRxChecksummingDisabled	Error	The rx-checksumming feature is disabled on the network interface.	These features are enabled by default. If they have been manually disabled, re-enable them for the network interface.
nodeTxChecksummingDisabled	Error	The tx-checksumming feature is disabled on the network interface.
nodeScatterGatherDisabled	Error	The scatter-gather feature is disabled on the network interface.
nodeTCPSegmentationOffloadDisabled	Error	The tcp-segmentation-offload feature is disabled on the network interface.
nodeGenericSegmentationOffloadDisabled	Error	The generic-segmentation-offload feature is disabled on the network interface.
nodeVzLicenseInactive	Critical	Node's Virtuozzo license is inactive.	Check and update the license.
guestPausedEIO	Critical	A virtual machine has been paused due to a disk I/O error.	Make sure that the node's partition where the VM's disks are stored has not run out of space. If it has, free up more space for the VM. If there is enough free space, evacuate the virtual machine (and any other critical data) from the physical disk it is stored on. Replace the physical disk as it is about to fail.
guestOsCrashed	Critical	A virtual machine has crashed after a BSOD or kernel panic in the guest OS. Collected for VMs only.	Find out the reasons for the crash, fix the VM, and restart any services that will not do so automatically.
nodeSMARTDiskError	Critical	A S.M.A.R.T counter for a node's disk is below threshold.	The node's disk is about to fail. Replace it as soon as possible.
nodeSMARTDiskWarning	Warning	A S.M.A.R.T counter for a node's disk is greater than zero.	Inspect the health of node's disk. You may need to replace it soon.
highCPUusage	Warning	Virtual environments's CPU usage has been over 90% for the last 10 minutes. The alert is disabled by default.	Check the virtual environment for potential problems, including software issues or malware. These alerts are disabled by default. To enable any of them, uncomment them in `vz-alerts.yml` and restart Prometheus.
highMemUsage	Warning	Virtual environments's memory usage has been over 95% for the last 10 minutes. The alert is disabled by default.
cpuUsageIncrease	Warning	Virtual environments's CPU usage has greatly increased compared to the previous week. The alert is disabled by default.
memUsageIncrease	Warning	Virtual environments's memory usage has greatly increased compared to the previous week. The alert is disabled by default.
nodeHighDiskWriteLatency	Warning	Write operation latency for a node's disk has been too high for the last 10 seconds.	Find and fix the reason why the I/O requests have been taking so long. Reasons can be high I/O load or deterioration of the disk's health.
nodeHighDiskReadLatency	Warning	Read operation latency for a node's disk has been too high for the last 10 seconds.
pendingKernelReboot	Warning	The node has been updated but not rebooted to the latest kernel.	Reboot the node and switch to the latest kernel.
lowPageCache	Warning	The node has high load average and very small page cache. The node is overloaded, possibly due to memory overcommitment.	Find out why the node is overloaded. If the node is overloaded due to overselling, migrate some of its virtual environments to other nodes.
highPfcacheUsage	Warning	Node's pfcache disk is 90% full and may run out of space.	Add more space to the pfcache disk or clean it up.
highVstorageMountLatency	Warning	The latency of vstorage-mount requests on a node has been too high for the last 10 seconds.	Check the health of the Virtuozzo Storage and its components. Fix the found issues.
slowIoRequest	Info	Virtual machine's I/O requests have been taking longer than 10 seconds.	Inspect the storage, be it local disks \| or Virtuozzo Storage, for potential issues and fix them. \|
unresponsiveBalloonDriver	Info	Virtual machine's balloon driver is not responding. The VM has stopped reporting its memory usage statistics and is not releasing node's memory automatically.	Find out what has happened to the virtio_balloon kernel module inside the VM. Reload the module.