First-Class Alerts for Disk Health
Context
Synology NAS boxes beep when a disk develops bad sectors — you hear it, SSH in, and deal with it. Without active alerting, a braid NAS user has no idea anything is wrong unless they happen to run braid status.
Decision
Alert as primary domain concept
braid has first-class Alerts. An Alert represents “something happened that needs human acknowledgment.” Beeping is one notification mechanism for an active alert. braid status is the primary surface for understanding alert details. braid ack acknowledges current alerts and silences notifications.
Shared alert computation
A single shared computation produces an AlertState consumed by all surfaces — braid monitor (exit code), braid status (banner + causes), TUI (banner + indicators). No surface re-encodes alert logic.
Alert causes
AlertCause is an explicit enum:
BtrfsDeviceErrors { devid }— non-zero btrfs device stat counters above acked baseline, excluding alert-local missing devidsMissingDevice { devid }— device missing from poolSmartdAlert— smartd SMART health warningComputationError { detail }— probe or parse failed before a structured cause could be determined
The status banner is cause-neutral (“disk health issue detected”); cause details appear below it and in JSON output.
Two detection sources, one alert model
braid owns btrfs device stats + missing device detection. smartd owns SMART monitoring and writes a flag file (/var/lib/braid/smartd-alert) when triggered. The shared computation checks btrfs stats, missing devices, and smartd.
All five btrfs device stat counters trigger alerts
write_io_errs, read_io_errs, flush_io_errs, corruption_errs, generation_errs. Any non-zero counter above the acked baseline triggers an alert for a present recognized devid. Devids in the alert-local missing set are excluded from BtrfsDeviceErrors and alert through MissingDevice instead.
Two kernel paths feed those counters: ordinary I/O and scrub. Scrub records
read, checksum, and generation failures by incrementing
BTRFS_DEV_STAT_READ_ERRS, BTRFS_DEV_STAT_CORRUPTION_ERRS, and
BTRFS_DEV_STAT_GENERATION_ERRS in
reference/linux/fs/btrfs/scrub.c:985-993. The monitor polls the same device
stats either way, so scrub-discovered uncorrectable errors reach the operator
through the same BtrfsDeviceErrors cause and beep as everyday I/O errors; a
separate scrub-status alert probe would be redundant with this pipeline.
Latched alerts
Alerts persist until braid ack — even if the triggering condition disappears. This means “something happened that needs acknowledgment,” not “something is currently true.” This avoids cross-source bugs where one source clearing could hide another source’s alert, and matches Synology UX.
Ack snapshots gating inputs before probing
cmd_ack reads the alert latch, the smartd flag (smartd-alert), and the ack cleanup-pending sentinel (alert-cleanup-pending) once at function entry, before probe_pool_alerts. Every decision in that ack – the gate that decides whether to proceed, the cleanup-only retry branch, and the cleanup that removes alert files – references that single snapshot. If the sentinel is the only live signal, cmd_ack runs a cleanup-only retry branch before probe_pool_alerts so recovery does not depend on probe success and does not rewrite the acknowledged baseline. The alert probe is devid-keyed and intentionally does not depend on LUKS UUID identity or pool FSID. The pool lock at /run/braid-pool.lock already serializes monitor vs ack vs add/remove writers, but the smartd hook is intentionally unlocked, so a per-ack snapshot is the only mechanism that gives ack a coherent view of smartd state.
The smartd flag is cleared during cleanup when either the snapshot observed the flag active or the snapshot’s latch carried a SmartdAlert cause. The first arm covers the normal “flag present, ack silences it” case. The second arm is an explicit exception for the crash-recovery case where a prior cycle latched SmartdAlert but the flag was already absent at snapshot, such as a partially-applied earlier ack, manual state, or filesystem-level divergence. The user’s ack is aimed at the latched smartd source, so a flag that the smartd hook writes during the probe is part of that source and is cleared.
A flag that exists at cleanup time when the snapshot saw neither active smartd state nor a latched SmartdAlert cause arrived after the snapshot and is left in place: the next monitor cycle is responsible for latching it cleanly.
Ack state keyed by btrfs devid
Acked baselines are keyed by btrfs devid (acked-stats.json maps stringified devid to baseline) – no path or LUKS UUID mapping is required to associate a stats row with its baseline. The parser captures missing device devids from MISSING sentinel lines.
Membership cross-reference is performed at the alert-pipeline boundary, not at the baseline-keying level. AlertPoolState::recognized_devids (in cli/src/probe.rs) returns the union of present_devids, null_underlying, and missing_devids for the current cycle. Both compute_alert_state and snapshot_current filter btrfs device stats rows against that set before emitting causes or writing baselines. A stats row whose devid is outside the recognized set is treated as transient/stale identity: it cannot latch BtrfsDeviceErrors, and braid ack does not persist a baseline for it, which prevents a loop on the next monitor cycle’s reconcile_acked_stats prune.
Within the recognized set, compute_alert_state also skips rows whose devid is in the alert-local missing set (missing_devids plus null_underlying). Those rows alert through MissingDevice, not BtrfsDeviceErrors, regardless of the device string btrfs printed. snapshot_current still records recognized rows by devid before layering missing_acked = true from the missing set, so a returning member does not re-alert on stale counters already acknowledged while missing.
Ack state separate from pool.json
Different concerns (identity vs acknowledgment), different write patterns, different risk profiles (precious vs disposable). Stored at /var/lib/braid/acked-stats.json.
Ack state is machine-local
On a new machine, acked state doesn’t exist — everything evaluates fresh.
braid monitor is a pure detector
Checks state and returns an exit code. Does not start/stop services. The systemd wrapper starts the beeper on exit 1.
Exit codes:
- 0 – ok, pool offline with no active alerts, or pool-lock-contended cycle (silently skipped; re-evaluated on the next timer tick)
- 1 – alert active (disk health issue OR indeterminate state latched as
ComputationError– e.g. probe failure, parse failure, unmapped device) - 2 – pre-
cmd_monitorsetup failure (e.g. pool-lock I/O, config load failure). Reserved for “could not even attempt to detect”; never emitted bycmd_monitoritself.
Fail closed: any failure inside cmd_monitor that leaves pool state indeterminate latches a ComputationError cause and reports exit 1, so the systemd wrapper starts the beeper. Exit 2 means the monitor never ran – a beep would be meaningless because there is no AlertState to report.
Alert-state mutators are serialized by /run/braid-pool.lock. Every command that writes acked-stats.json or alert-latch.json (monitor, ack, add, remove, remove-missing) acquires /run/braid-pool.lock in Rust dispatch (see ADR 026) before reading state or running probes. This is intentionally the same lock used by pool mutators: monitor and ack perform read-modify-write cycles around subprocess I/O, while add/remove/remove-missing prune acked baselines as membership changes. Sharing one lock keeps “baseline and latch clear” authoritative and prevents stale monitor snapshots from resurrecting acknowledged alerts.
Mount presence is read from /proc/self/mountinfo via mount_check::fstype_at_mount_via_fs, not from findmnt. A readable, well-formed mountinfo file with no entry for the configured mount point is legitimate PoolOffline and exits 0. Any mountinfo I/O failure, malformed line, or duplicate target entry is indeterminate state: it surfaces as ProbeError::MountInfo, latches ComputationError, exits 1, and starts the beeper.
Self-heals stale ack state (resets missing_acked for now-present devids after drive replacement).
Periodic one-shot, not daemon
systemd timer + oneshot service. No mount condition on the timer — braid monitor handles pool-not-mounted gracefully (exit 0).
On by default
braid.monitor.enable defaults to true when braid.enable is true. beep/pcspkr failures are silently swallowed.
Audible doctor beep is opt-in
Plain braid doctor reports the alert-beep check as skipped after confirming
beep monitoring is configured. braid doctor --beep runs the canonical
braid-beep-probe wrapper so operators can test the real alert sound on
purpose. braid doctor --json always skips the audible probe, and --json
conflicts with --beep at parse time, so machine-readable output has no
audible side effects.
Latch as append/refresh log
The alert latch is an append/refresh log of all unacked causes from all sources. Each monitor cycle loads the existing latch, computes new causes, and merges. Previously-latched causes that aren’t re-detected are carried forward. Newly-detected causes replace their latched counterpart (same key = fresher evidence). This ensures all cause types persist until braid ack, even if the triggering condition resolves — fixing the invariant for all sources, not just journal.
Corrupt latch recovery
load_alert_latch returns Result<Option<AlertState>, LatchLoadError> so callers can distinguish three outcomes: file absent (Ok(None), normal – no active alerts), I/O failure (Err(Read)), and unparseable on-disk content (Err(Parse)). Each caller picks its own fail-closed policy:
cmd_monitoris the only path that mutates the latch. On read/parse failure it quarantines the bad bytes by linkingalert-latch.jsontoalert-latch.json.corruptand then removing the live path, then writes a fresh latch containing a loudComputationErrorcause whosedetailnames the failure. Quarantine useshard_link+remove_file(notrename) so an already-existing sidecar is detected atomically bylink(2)’sEEXIST; when that happens, the first sidecar is preserved as the highest-value forensic snapshot and the new corruption is surfaced only in theComputationErrordetail. Any I/O failure during quarantine is folded into the same detail rather than silently dropped. The corruption signal is folded into a singleComputationError(not appended as a second cause), becausemerge_into_latchcollapses everyComputationErrorinto one slot viasame_cause_key– appending two would silently drop one.cmd_statusis the read-only surface:resolve_alert_statesurfaces a corrupt latch as aComputationErrorcause but never moves the file (status must not mutate state).cmd_acktreats a corrupt latch as an active alert for gating purposes — otherwise a genuinely unmounted ack would refuse withPoolNotMountedand the user would have no way to clear a corrupt file with the pool offline. Mounted ack and genuinely unmounted ack clean up bothalert-latch.jsonand the.corruptsidecar. A foreign fstype at the configured mount point is a probe error, not offline ack, and preserves the unreadable latch bytes.
This preserves “latched until ack” even when the on-disk state is unreadable: the operator sees a loud ComputationError, the bad bytes are preserved for forensics until an ack path that can safely clean them up, and ack succeeds for mounted or genuinely unmounted pools.
Cleanup ordering and retry-on-failure
Ack cleanup preserves three invariants. First, the beeper stop hook is attempted before any fallible cleanup operation; the hook is best-effort, so the invariant is that the stop attempt runs, not that sound was proven stopped. Second, destructive removals run in smartd-alert -> alert-latch.json -> alert-latch.json.corrupt order, so the corrupt sidecar leaves last and the forensic guarantee above is preserved across cleanup failures. Third, ack writes alert-cleanup-pending after the stop hook and before the first destructive step, then clears it only after the last destructive step succeeds.
CleanupFailed recovery has two cases. If creating alert-cleanup-pending itself fails, no destructive removal has run, so the original entry signals are byte-identical and the retry is driven by the normal ack path. If marker creation succeeded and a later step failed, the sentinel remains on disk. cmd_ack consults that sentinel before probing; when it is the only live signal, the hoisted cleanup-only branch reruns cleanup without probe_pool_alerts, without runner requests, and without rewriting acked-stats.json. Either path makes re-running braid ack after fixing the I/O fault genuinely idempotent.
Offline ack policy
braid ack works with the pool locked, but only when the pool is genuinely unmounted: /proc/self/mountinfo has no entry for the configured mount point. If the configured mount point is occupied by a non-btrfs filesystem, cmd_ack returns ProbeError::NotBtrfs; it must not clear alert-latch.json, remove smartd-alert, create or rewrite acked-stats.json, stop the beeper, or quarantine corrupt latch bytes.
For genuine offline ack, the persistence layer has an asymmetry by cause type:
MissingDevice { devid }– offline ack reads the latch and appliesmissing_acked = trueto that devid inacked-stats.json(insert-or-update; existingdevice_statsbaselines are preserved). The next mounted monitor cycle suppresses the cause, andreconcile_acked_statsself-healsmissing_ackedback tofalseif the device returns.BtrfsDeviceErrors { devid }– offline ack refuses with an actionable error (“cannot ack btrfs device errors while pool is offline – unlock the pool first”). The counter baseline that suppresses re-firing is the current output ofbtrfs device stats, which requires a mounted pool. Refusing the whole ack (not partial-acking other causes) avoids leaving the operator in an “I acked but it still says ALERT” state.SmartdAlert– offline ack removes the smartd flag file (the authoritative trigger source); noacked-stats.jsonwrite is needed.ComputationError– offline ack removes the latch; the cause re-fires on the next monitor cycle only if the underlying computation still fails.
Coupled to the asymmetry: offline ack only loads acked-stats.json when at least one MissingDevice cause is latched, so an unrelated corrupt acked-stats.json cannot block an offline ack of a pure SmartdAlert or ComputationError latch. When acked-stats.json is loaded (a MissingDevice cause is being applied), the fail-closed load_acked_stats_fallible is used so corrupt files are propagated as I/O errors rather than silently overwritten – matching the policy in drop_ghost_acked_for_devids.
Acked-stats hygiene across pool membership changes
btrfs allocates new devids as last_devid + 1 (kernel: fs/btrfs/volumes.c, find_next_devid), so a remove-then-add sequence reuses the removed devid only when that devid was the current maximum at remove time. Removing a non-max devid leaves a permanent gap. A stale acked-stats entry for a reused devid would otherwise carry the previous holder’s device_stats baseline (suppressing health alerts until counters exceed the ghost) or its missing_acked = true flag (suppressing missing-device alerts) onto the fresh disk.
Invariant: a reused devid must never inherit the previous holder’s ack baseline.
Three layers enforce it:
- Add-time guard (correctness boundary):
cmd_addclears acked-stats unconditionally on bootstrap and drops the assigned devid per-disk inside the live-pool add loop.cmd_recover, when finishing an interrupted add, mirrors both: bootstrap-recovery callsremove_acked_stats, and live-add recovery drops every journaled target’s devid (per-arm after a replayedpool_add_device, and via a final sweep when the target was already live at recovery entry – the committed-but-closed crash window). Cleanup failure here is command-fatal in bothcmd_addandcmd_recover: the error names the stage and instructs the user to delete the file before relying on alerts. - Remove-time prune (hygiene):
cmd_removeandcmd_remove_missingdrop the affected devid on success.cmd_recovermirrors the prune for committed removes only – the Remove guard may restore a target whose eviction did not complete, in which case its acked-stats entry is a legitimate baseline that must survive. Cleanup failure here is non-fatal (warning) – the nextaddfor that devid will catch it via layer 1. Thecmd_removeplanner enriches the journaledpre_membershipwith the target’s live btrfs devid so recovery can resolve it after a discover-timepool.json. - Monitor reconcile (defense-in-depth):
cmd_monitorprunes orphan entries (devid no longer inpool.present_devids,pool.null_underlying, orpool.missing_devids) every cycle. This catches crash recovery and manual btrfs operations performed outside braid. It cannot detect ghost data once a devid is reused, so the add-time layer is the boundary for that case. The read itself usesload_acked_stats_fallibleso a corrupt or unreadableacked-stats.jsonlatchesComputationErrorinstead of silently re-firing acked causes against an empty baseline, matching offline ack anddrop_ghost_acked_for_devids. A save failure during reconcile latches the sameComputationErrorso a persistent FS write fault (EROFS, ENOSPC, or EACCES onacked-stats.jsonor its parent) surfaces via exit-1 beep rather than accumulating only in journald.
Backstop: independently of those three layers, the alert computation fails loud when the acked baseline is no longer comparable to the current counter stream. compute_alert_state treats an acked counter that exceeds the current btrfs device stats value as 0 and alerts on any nonzero current. btrfs device-stats counters are persistent and monotonic (reset only by -z, which braid never runs), so the only ways a current value can sit below the ack baseline are a reused devid that inherited a ghost baseline before add/recover cleanup dropped its acked entry (the committed-but-closed crash window above), or an operator resetting the live counters with btrfs device stats -z. The three layers aim to remove a stale baseline; this guard ensures that if one transiently survives, it cannot suppress a later nonzero counter.
Rejected alternatives
- Daemon-based monitoring: more complex lifecycle management for no benefit over a timer + oneshot
- Storing alerts in a database: unnecessary complexity; file-based flag + JSON is sufficient
- Per-surface alert logic: each surface re-checking btrfs stats independently would lead to inconsistencies
- Counter-based thresholds (e.g., alert after N errors): any non-zero counter above baseline is worth investigating; thresholds delay detection
- Kernel journal scanning: originally implemented as a supplementary alert source scanning
journalctl -kfor “BTRFS error” messages. Removed because btrfs commits every 30 seconds, which increments device stats counters for any disk error within that window. The 5-minute monitor poll catches those counters reliably. Journal scanning was redundant with device stats and added significant complexity (cursor tracking, regex parsing, crash-safe cursor ordering, latch merge logic). Repro VMs intests/repro/kernel-journal-*preserve the empirical evidence from the original investigation.