ZFS Metrics in Prometheus
By hernil
Now that we have set up Prometheus alerting via ntfy we want to monitor some actually useful metrics and be alerted when things look out of place. We start off by monitoring perhaps the most important part of a system - the filesystem. Thankfully both ZFS and Prometheus are pretty common out there and we can stand on shoulders to get this set up. See sources at the bottom of the article.
Built in ZFS monitoring (ZED)
ZFS has some built in alerting - known as ZED. However this is mostly email based (in addition to a few other providers like Pushbullet). Ntfy support has recently been merged but is not available in the repositories yet. Prometheus is on our “to learn” list so we pivoted away from ZED.
Monitoring ZFS metrics
Prometheus is pull based - meaning that it needs to have a source it can ask for monitoring metrics. We will use zfs_exporter as a systemd service to act as that source. It polls the zfs file systems locally and exposes metrics for Prometheus to consume.
Download the last release of zfs_exporter
and put the binary in /usr/local/bin
. Don’t forget to make it executable with chmod +x zfs_exporter
.
Create a new systemd service at /etc/systemd/system/zfs_exporter.service
[Unit]
Description=zfs_exporter
After=network-online.target
[Service]
Restart=always
RestartSec=5
TimeoutSec=5
User=root
Group=root
ExecStart=/usr/local/bin/zfs_exporter --collector.dataset-snapshot
[Install]
WantedBy=multi-user.target
Make sure it’s enabled and started
sudo systemctl daemon-reload && \
sudo systemctl enable zfs_exporter.service && \
sudo systemctl start zfs_exporter
And voila, you are now exposing ZFS health metrics on port 9134. If that is a problem, consider firewalling things if you are worried of leaking dataset and snapshot names.
If you want to exclude something at the very source then this can be a useful parameter.
--exclude=EXCLUDE
Exclude datasets/snapshots/volumes that match the provided regex (e.g. ‘^rpool/docker/’), may be specified multiple times.
Pulling to Prometheus
In your prometheus.yml
file add
- job_name: zfs_exporter
metrics_path: /metrics
scrape_timeout: 60s
static_configs:
- targets: [my_host:9134]
metric_relabel_configs:
- source_labels: ['name']
regex: ^([^@]*).*$
target_label: filesystem
replacement: ${1}
- source_labels: ['name']
regex: ^.*:.._(.*)$
target_label: snapshot_type
replacement: ${1}
I’ll just quote my source here:
Remember to set the scrape_timeout to at least of 60s as the exporter is sometimes slow to answer, specially on low hardware resources.
The relabelings are done to be able to extract the filesystem and the backup type of the snapshots’ metrics. This assumes that you are using sanoid to do the backups, which gives metrics such as:
zfs_dataset_written_bytes{name="main/apps/nginx_public@autosnap_2023-06-03_00:00:18_daily",pool="main",type="snapshot"} 0
For that metric you’ll get that the filesystem is
main/apps/nginx_public
and the backup type is `daily.
Configuring alerts
Create the file alert.zfs.rules
groups:
- name: ZFS
rules:
# ZFS related alerts
- alert: ZfsPoolOutOfSpace
expr: zfs_pool_free_bytes * 100 / zfs_pool_size_bytes < 10 and ON (instance, device, mountpoint) zfs_pool_readonly == 0
for: 5m
labels:
severity: warning
annotations:
summary: ZFS pool out of space (instance {{ $labels.instance }})
description: "Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsPoolUnhealthy
expr: zfs_pool_health > 0
for: 5m
labels:
severity: critical
annotations:
summary: ZFS pool unhealthy (instance {{ $labels.instance }})
description: "ZFS pool state is {{ $value }}. Where:\n - 0: ONLINE\n - 1: DEGRADED\n - 2: FAULTED\n - 3: OFFLINE\n - 4: UNAVAIL\n - 5: REMOVED\n - 6: SUSPENDED\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsCollectorFailed
expr: zfs_scrape_collector_success != 1
for: 5m
labels:
severity: warning
annotations:
summary: ZFS collector failed (instance {{ $labels.instance }})
description: "ZFS collector for {{ $labels.instance }} has failed to collect information\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# ZFS snapshot related rules
#
# We ignore the system docker datasets
- alert: ZfsDatasetWithNoSnapshotsError
expr: zfs_dataset_used_by_dataset_bytes{filesystem!~"^.*rpool/var/lib/docker.*$"} > 200e3 unless on (instance,filesystem) count by (instance, filesystem, job) (zfs_dataset_used_bytes{type="snapshot"}) > 1
for: 5m
labels:
severity: error
annotations:
summary: The dataset {{ $labels.filesystem }} at {{ $labels.instance }} doesn't have any snapshot.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeFrequentlySizeError
expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='frequently'})[60m:15m]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[60m:15m]) == 4
for: 5m
labels:
severity: error
annotations:
summary: The size of the frequently snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system or the data has not changed in the last hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeHourlySizeError
expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='hourly'})[2h:30m]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[2h:30m]) == 4
for: 5m
labels:
severity: error
annotations:
summary: The size of the hourly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system or the data has not changed in the last hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeDailySizeError
expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='daily'})[2d:12h]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[2d:12h]) == 4
for: 5m
labels:
severity: error
annotations:
summary: The size of the daily snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system or the data has not changed in the last hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeMonthlySizeError
expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='monthly'})[60d:15d]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[60d:15d]) == 4
for: 5m
labels:
severity: error
annotations:
summary: The size of the monthly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system or the data has not changed in the last hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeFrequentlyUnexpectedNumberError
expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{snapshot_type="frequently",type="snapshot"}) < 4)[16m:8m]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[16m:8m]) == 2
for: 5m
labels:
severity: error
annotations:
summary: The number of the frequent snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeHourlyUnexpectedNumberError
expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{snapshot_type="hourly",type="snapshot"}) < 24)[1h10m:10m]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[1h10m:10m]) == 7
for: 5m
labels:
severity: error
annotations:
summary: The number of the hourly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeDailyUnexpectedNumberError
expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='daily'}) < 30)[1d2h:2h]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[1d2h:2h]) == 13
for: 5m
labels:
severity: error
annotations:
summary: The number of the hourly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeMonthlyUnexpectedNumberError
expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='monthly'}) < 6)[31d:1d]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[31d:1d]) == 31
for: 5m
labels:
severity: error
annotations:
summary: The number of the monthly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- record: zfs_dataset_snapshot_bytes
# This expression is not real for datasets that have children, so we're going to create this metric only for those datasets that don't have children
# I'm also going to assume that the datasets that have children don't hold data
expr: zfs_dataset_used_bytes - zfs_dataset_used_by_dataset_bytes and zfs_dataset_used_by_dataset_bytes > 200e3
- alert: ZfsSnapshotTooMuchSize
expr: zfs_dataset_snapshot_bytes / zfs_dataset_used_by_dataset_bytes > 5 and zfs_dataset_snapshot_bytes > 2e10
for: 5m
labels:
severity: warning
annotations:
summary: The snapshots of the dataset {{ $labels.filesystem }} at {{ $labels.hostname }} use more than five times the data space
description: "The snapshots of the dataset {{ $labels.filesystem }} at {{ $labels.hostname }} use more than five times the data space\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Sources
Heavily inspired by: