ZFS Metrics in Prometheus

By hernil

January 24, 2024

Now that we have set up Prometheus alerting via ntfy we want to monitor some actually useful metrics and be alerted when things look out of place. We start off by monitoring perhaps the most important part of a system - the filesystem. Thankfully both ZFS and Prometheus are pretty common out there and we can stand on shoulders to get this set up. See sources at the bottom of the article.

Built in ZFS monitoring (ZED)

ZFS has some built in alerting - known as ZED. However this is mostly email based (in addition to a few other providers like Pushbullet). Ntfy support has recently been merged but is not available in the repositories yet. Prometheus is on our “to learn” list so we pivoted away from ZED.

Monitoring ZFS metrics

Prometheus is pull based - meaning that it needs to have a source it can ask for monitoring metrics. We will use zfs_exporter as a systemd service to act as that source. It polls the zfs file systems locally and exposes metrics for Prometheus to consume.

Download the last release of zfs_exporter and put the binary in /usr/local/bin. Don’t forget to make it executable with chmod +x zfs_exporter.

Create a new systemd service at /etc/systemd/system/zfs_exporter.service

[Unit]
Description=zfs_exporter
After=network-online.target

[Service]
Restart=always
RestartSec=5
TimeoutSec=5
User=root
Group=root
ExecStart=/usr/local/bin/zfs_exporter --collector.dataset-snapshot

[Install]
WantedBy=multi-user.target

Make sure it’s enabled and started

sudo systemctl daemon-reload && \
sudo systemctl enable zfs_exporter.service && \
sudo systemctl start zfs_exporter

And voila, you are now exposing ZFS health metrics on port 9134. If that is a problem, consider firewalling things if you are worried of leaking dataset and snapshot names.

If you want to exclude something at the very source then this can be a useful parameter.

--exclude=EXCLUDE

Exclude datasets/snapshots/volumes that match the provided regex (e.g. ‘^rpool/docker/’), may be specified multiple times.

Pulling to Prometheus

In your prometheus.yml file add

- job_name: zfs_exporter
  metrics_path: /metrics
  scrape_timeout: 60s
  static_configs:
  - targets: [my_host:9134] 
  metric_relabel_configs:
    - source_labels: ['name']
      regex: ^([^@]*).*$
      target_label: filesystem
      replacement: ${1}
    - source_labels: ['name']
      regex: ^.*:.._(.*)$
      target_label: snapshot_type
      replacement: ${1}

I’ll just quote my source here:

Remember to set the scrape_timeout to at least of 60s as the exporter is sometimes slow to answer, specially on low hardware resources.

The relabelings are done to be able to extract the filesystem and the backup type of the snapshots’ metrics. This assumes that you are using sanoid to do the backups, which gives metrics such as:
zfs_dataset_written_bytes{name="main/apps/nginx_public@autosnap_2023-06-03_00:00:18_daily",pool="main",type="snapshot"} 0
For that metric you’ll get that the filesystem is main/apps/nginx_public and the backup type is `daily.

Configuring alerts

Create the file alert.zfs.rules

groups:
- name: ZFS
  rules:

  # ZFS related alerts
  - alert: ZfsPoolOutOfSpace
    expr: zfs_pool_free_bytes * 100 / zfs_pool_size_bytes < 10 and ON (instance, device, mountpoint) zfs_pool_readonly == 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: ZFS pool out of space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsPoolUnhealthy
    expr: zfs_pool_health > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: ZFS pool unhealthy (instance {{ $labels.instance }})
      description: "ZFS pool state is {{ $value }}. Where:\n  - 0: ONLINE\n  - 1: DEGRADED\n  - 2: FAULTED\n  - 3: OFFLINE\n  - 4: UNAVAIL\n  - 5: REMOVED\n  - 6: SUSPENDED\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsCollectorFailed
    expr: zfs_scrape_collector_success != 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: ZFS collector failed (instance {{ $labels.instance }})
      description: "ZFS collector for {{ $labels.instance }} has failed to collect information\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  
  # ZFS snapshot related rules
  #
  # We ignore the system docker datasets
  - alert: ZfsDatasetWithNoSnapshotsError
    expr: zfs_dataset_used_by_dataset_bytes{filesystem!~"^.*rpool/var/lib/docker.*$"} > 200e3 unless on (instance,filesystem) count by (instance, filesystem, job) (zfs_dataset_used_bytes{type="snapshot"}) > 1
    for: 5m
    labels:
      severity: error
    annotations:
      summary: The dataset {{ $labels.filesystem }} at {{ $labels.instance }} doesn't have any snapshot.
      description: "There might be an error on the snapshot system\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsSnapshotTypeFrequentlySizeError
    expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='frequently'})[60m:15m]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[60m:15m]) == 4
    for: 5m
    labels:
      severity: error
    annotations:
      summary: The size of the frequently snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
      description: "There might be an error on the snapshot system or the data has not changed in the last hour\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsSnapshotTypeHourlySizeError
    expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='hourly'})[2h:30m]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[2h:30m]) == 4
    for: 5m
    labels:
      severity: error
    annotations:
      summary: The size of the hourly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
      description: "There might be an error on the snapshot system or the data has not changed in the last hour\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsSnapshotTypeDailySizeError
    expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='daily'})[2d:12h]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[2d:12h]) == 4
    for: 5m
    labels:
      severity: error
    annotations:
      summary: The size of the daily snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
      description: "There might be an error on the snapshot system or the data has not changed in the last hour\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsSnapshotTypeMonthlySizeError
    expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='monthly'})[60d:15d]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[60d:15d]) == 4
    for: 5m
    labels:
      severity: error
    annotations:
      summary: The size of the monthly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
      description: "There might be an error on the snapshot system or the data has not changed in the last hour\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsSnapshotTypeFrequentlyUnexpectedNumberError
    expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{snapshot_type="frequently",type="snapshot"}) < 4)[16m:8m]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[16m:8m]) == 2
    for: 5m
    labels:
      severity: error
    annotations:
      summary: The number of the frequent snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
      description: "There might be an error on the snapshot system\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsSnapshotTypeHourlyUnexpectedNumberError
    expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{snapshot_type="hourly",type="snapshot"}) < 24)[1h10m:10m]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[1h10m:10m]) == 7
    for: 5m
    labels:
      severity: error
    annotations:
      summary: The number of the hourly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
      description: "There might be an error on the snapshot system\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsSnapshotTypeDailyUnexpectedNumberError
    expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='daily'}) < 30)[1d2h:2h]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[1d2h:2h]) == 13
    for: 5m
    labels:
      severity: error
    annotations:
      summary: The number of the hourly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
      description: "There might be an error on the snapshot system\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ZfsSnapshotTypeMonthlyUnexpectedNumberError
    expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='monthly'}) < 6)[31d:1d]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[31d:1d]) == 31
    for: 5m
    labels:
      severity: error
    annotations:
      summary: The number of the monthly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
      description: "There might be an error on the snapshot system\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - record: zfs_dataset_snapshot_bytes
    # This expression is not real for datasets that have children, so we're going to create this metric only for those datasets that don't have children
    # I'm also going to assume that the datasets that have children don't hold data
    expr: zfs_dataset_used_bytes - zfs_dataset_used_by_dataset_bytes and zfs_dataset_used_by_dataset_bytes > 200e3
  - alert: ZfsSnapshotTooMuchSize
    expr: zfs_dataset_snapshot_bytes / zfs_dataset_used_by_dataset_bytes > 5 and zfs_dataset_snapshot_bytes > 2e10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: The snapshots of the dataset {{ $labels.filesystem }} at {{ $labels.hostname }} use more than five times the data space
      description: "The snapshots of the dataset {{ $labels.filesystem }} at {{ $labels.hostname }} use more than five times the data space\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Sources

Heavily inspired by:

https://lyz-code.github.io/blue-book/zfs_exporter/

Input or feedback to this content? Reply via email!

Built in ZFS monitoring (ZED)

Monitoring ZFS metrics

Pulling to Prometheus

Configuring alerts

Sources

Related Articles