braid monitor
Checks btrfs device error stats, missing devices, and SMART alerts. Designed to be run automatically by a systemd timer (every 5 minutes by default). Exits with a status code that drives the alert pipeline.
When to use it
You normally don’t run this by hand – the braid-monitor.timer systemd unit runs it automatically. Use it directly when debugging the alert system or testing your monitoring setup.
Basic example
sudo braid monitor
No output on success. Check the exit code:
sudo braid monitor; echo $?
Exit codes
| Code | Meaning |
|---|---|
| 0 | Healthy, pool is offline, or another braid command holds the pool lock (cycle skipped, re-evaluated on the next timer tick) |
| 1 | Alert active – one or more problems detected |
| 2 | Pre-monitor setup error (e.g. pool-lock I/O, config load failure) |
What triggers an alert (exit 1)
- btrfs device errors – any device in the pool has read, write, flush, corruption, or generation errors above the acknowledged baseline, including errors discovered during scrub.
- Missing device – btrfs reports a device as missing or a pool device has a null underlying path.
- SMART alert – smartd has written a SMART alert flag (via the braid smartd notifier).
- Computation error – a probe, parse, btrfs device stats call, mountinfo read, acked-stats baseline load, acked-stats save during self-heal, or alert-latch load/quarantine failed. Monitor fails closed: it latches a
ComputationErrorcause so the beeper fires andbraid statusshows the detail.
Flags
None. Monitor has no flags – it reads from the braid config and state files.
What happens under the hood
- Checks if the pool is mounted. If not, exits 0 (nothing to monitor).
- Runs
btrfs device statson the pool mount point. - Loads the acknowledged-stats baseline (
acked-stats.json) from a previousbraid ack. If the file is unreadable or unparseable, monitor fails closed – it latches aComputationErrorrather than firing every acknowledged cause against an empty baseline. - Self-heals stale ack state before computing alerts: prunes baseline entries for devices no longer in the pool, and clears the missing-acked flag for any device that was acknowledged missing but is now present again. If the baseline changed, the updated
acked-stats.jsonis written immediately; a write failure (e.g. EROFS, ENOSPC) is itself a fail-closedComputationError. - Computes alert causes against the reconciled baseline: btrfs device errors above the baseline, missing/null-underlying devices, and the smartd alert flag.
- Merges the causes into the alert latch (
alert-latch.json). The latch is sticky: once an alert fires, it stays active untilbraid ackclears it.
Alert pipeline
braid monitor --writes--> alert-latch.json --> braid status / braid tui (display)
(timer, every 5m) --exit 1--> braid-alert.service (beeper + alertCommand)
smartd --start--> braid-alert.service (beeper)
--writes--> smartd-alert --> next braid monitor cycle (latches SmartdAlert)
On exit 1, the braid-monitor.service wrapper starts braid-alert.service (the beeper, plus any alertCommand). After that, two things stay active until you braid ack, each held by a different mechanism:
- The latch and exit 1 – held by monitor. Each cycle it writes the live causes to
alert-latch.json, merging them into the existing latch, and re-exits 1 while any cause remains.braid statusand the TUI read the same file for display. - The beep – held by
braid-alert.serviceitself, not the read-back. Once started it stays active on its own (the backoff beep loop when beep is enabled, or aRemainAfterExitoneshot when it’s off), so the wrapper’s per-cyclesystemctl startis a no-op and a skipped cycle (offline or lock-contended exit 0) does not silence it. The service never readsalert-latch.jsonor thesmartd-alertflag.
smartd is a second, independent trigger: on a SMART fault it starts braid-alert.service directly and writes the smartd-alert flag, which the next monitor cycle latches as a SmartdAlert cause.
The beep stops only when braid ack clears the latch and runs systemctl stop braid-alert.service.
Related commands
- ack – acknowledge alerts and silence the beeper
- doctor – one-time diagnostic; pass
--beepto test the alert beep - status – shows active alerts in the status output