braid doctor

Runs diagnostic checks on your braid configuration, pool health, RAID profile consistency, LUKS headers, auto-suspend wake path, and alerting hardware. Reports issues and suggests fixes.

When to use it

After initial setup, to verify everything is wired correctly.
Periodically, to catch drift (missing disks, mixed RAID profiles, broken alert speaker).
When something seems wrong and you want a quick health summary.

Basic example

sudo braid doctor

Output:

[ok]   config file     /etc/braid/config.json exists and is valid JSON
[ok]   config schema   required fields present and valid
[ok]   config perms    /etc/braid/config.json permissions ok
[ok]   declared disks  all 3 declared disks present
[ok]   missing devs    no missing devices
[ok]   enospc risk     per-device unallocated space healthy
[ok]   foreign uuids   no foreign LUKS UUIDs in live pool
[ok]   data profiles   data profile: RAID1
[ok]   meta profiles   metadata profile: RAID1
[ok]   system profiles  system profile: RAID1
[ok]   meta pressure   metadata pressure within bounds
[ok]   paused balance  no paused balance
[ok]   smart selftest disk1  passed ~2 days ago
[ok]   smart selftest disk2  passed ~12 days ago
[ok]   smart selftest disk3  passed ~30 days ago
[skip] alert beep      skipped (pass --beep to play the audible alert test beep)
[skip] ups daemon      skipped (braid.ups not enabled)
[skip] braid-online    skipped (braid.ups not enabled)
[ok]   mountpoint seal  pool is mounted -- the live filesystem governs writes
[skip] wake-on-lan     skipped (braid.autoSuspend not enabled)

The SMART self-test check emits one row per pool drive. If a drive has no recent completed self-test, the row includes a paste-ready smartctl command:

[warn] smart selftest disk2  no completed SMART self-test recorded -- run: smartctl -t short /dev/disk/by-id/...

The hint uses the stable by-id path: braid’s own diagnostic read prefers the member’s live backing device, but a smartctl -t short you run later should use by-id, which survives reboots and controller reordering.

When the pool is mounted, the declared disks check also confirms each member is assembled into the live btrfs pool. A member that passes its LUKS identity checks but is missing from the live device set – e.g. after a degraded mount or an interrupted reconciliation – warns as offline rather than counting as present:

[warn] declared disks  1/3 disks have problems: 1 present but not in the live pool: disk2 (/dev/disk/by-id/...)

To test the real alert sound:

sudo braid doctor --beep

Machine-readable output

sudo braid doctor --json

Prints a JSON object with status (one of ok, warn, fail, skip) and a checks array. Each check has name, status, and message. Per-drive checks also include subject.

--json mode never plays the alert beep test. The check still appears in the report as skip. --json and --beep conflict; run a separate sudo braid doctor --beep when you want to test the audible alert path.

What it checks

Check	What it does
`config_file`	Config exists and is valid JSON
`config_schema`	Required fields present and deserializable
`config_permissions`	Canonical `/etc/braid/config.json` is not world-writable and is owned by root; custom `--config` paths skip this check
`declared_disks`	Every UUID-keyed `pool.json` member is present, is a block device, has a readable LUKS header, its live LUKS UUID matches the `pool.json` key, and, when the pool is mounted, is assembled into the live btrfs pool. Warn if a member is missing, is not a block device, has an unreadable LUKS header or probe failure, is present and identity-verified but not assembled into the live pool (`offline`), or the pool is mounted but its live topology cannot be probed to verify assembly; Fail if a member’s live LUKS UUID does not match its `pool.json` key.
`pool_missing_devices`	No missing devices in the live pool – both btrfs-authoritative MISSING devices and hot-unplugged members whose LUKS mapper is still open (`null_underlying`). Warn lists each missing devid and routes remediation by sub-state: a btrfs-MISSING devid gets a `braid replace` recommendation (the optional `--missing-id` cross-check names only btrfs-MISSING devids); a hot-unplugged member is guided to re-plug + `braid lock`/`braid unlock` to recover (or `braid unlock --allow-degraded` when another member is still missing), or to `braid lock`/`braid unlock --allow-degraded` to promote it to MISSING then `braid replace` if the disk is gone.
`enospc_risk`	Warns when the pool is one disk-loss away from insufficient RAID1 chunk-pair space. Per-device threshold scales with pool size (min(1 GiB, 10% of total device bytes), matching the kernel’s effective data chunk size). Uses the same shared `evaluate_enospc_risk` predicate that `braid status` reports and `braid monitor` now raises proactively as a non-beeping Warning
`foreign_luks_uuid`	Fail when the live (mounted) pool contains a btrfs device whose LUKS UUID is not declared in `pool.json` (a foreign disk). The message pairs each foreign UUID and its mapper with a paste-ready `btrfs device remove /dev/mapper/<mapper> <mount>` then `cryptsetup close <mapper>` recipe – the observed mapper name and pool mount point are substituted in, and multiple foreign disks each get their own recipe. Skipped when the pool is not mounted.
`data_profile_mismatch`	Data block groups all use the same RAID profile
`metadata_profile_mismatch`	Metadata block groups all use the same RAID profile
`system_profile_mismatch`	System block groups all use the same RAID profile
`metadata_enospc_pressure`	Warns when metadata is near the next allocation threshold and fewer than two RAID1 devices have enough unallocated space for the next metadata chunk
`paused_balance`	Warns if a btrfs balance is paused on the mounted pool (e.g. a prior balance interrupted by reboot, manual pause, or kernel pause) and suggests resuming with `btrfs balance resume <mount>`.
`smart_self_test`	One result per pool drive: runs `smartctl --json -A -l selftest <device>` against each – `<device>` is the member’s live backing device (e.g. `/dev/sda`) when it is assembled into the mounted pool, otherwise its persisted by-id path (pool offline, probe failed, or that member not currently assembled – e.g. missing or hot-unplugged on a degraded mount) – then reports `Fail` on an active SMART self-test failure, `Warn` if no completed test in the last 90 powered-on days (or never), `Ok` otherwise, or `Skip` for NVMe/SCSI/unsupported drives. In `--json`, every per-drive result carries `name: "smart_self_test"` and a `subject` field naming the pool member; if pool membership is missing or empty, a single `Skip` result with `name: "smart_self_test"` is emitted; if pool membership is corrupt or unreadable, a single `Warn` result with the same `name` is emitted instead. In both fallbacks the `subject` field is omitted. Scripts should check whether `subject` is present before keying on it.
`beep_path`	PC speaker alert beep is configured; with `--beep`, the alert beep command succeeds
`ups_daemon`	With UPS enabled, `upsc` is available and can query the UPS daemon; missing or spawn-failed `upsc` is a failure, daemon unreachable/non-zero `upsc` is a warning
`braid_online_active`	With UPS enabled and the pool mounted, `braid-online.service` is active so shutdown unmounts the pool. Standalone CLI installs (no NixOS module) skip this – there is no `braid-online.service` to verify.
`mountpoint_immutable`	On `systemd_lifecycle` installs, verifies the offline-pool mountpoint seal – the immutable attribute braid sets on the mount point while the pool is unmounted, so a stray write fails with `EPERM` instead of landing on the root filesystem and being hidden once the pool mounts (see seal-mountpoint). Warn when the pool is offline but the mount point is mutable (it re-seals on the next boot or `nixos-rebuild switch`; run `braid seal-mountpoint` to re-seal now); Fail when the pool is mounted but the mount point inode is immutable (a live pool root must never be sealed – a tripwire for a bug or external interference). Ok when the offline mount point is sealed, or when the pool is mounted (the live filesystem governs writes). Skip on standalone CLI installs (`systemd_lifecycle` unset, no NixOS module), where the boot-time seal this check verifies is not present, or when the mount-state / immutable-attribute probe is indeterminate.
`wake_on_lan`	With auto-suspend enabled, `ethtool <interface>` reports magic-packet wake support and active `Wake-on: g`; disabled, unsupported, missing, or unparseable WoL state is a failure

Flags

Flag	Effect
`--json`	Machine-readable JSON output; never plays the alert beep test
`--beep`	Play the audible alert test beep; conflicts with `--json`

Exit codes

0 – all checks passed (ok/warn/skip)
1 – at least one check failed

What happens under the hood

Reads and validates /etc/braid/config.json.
Loads UUID-keyed pool.json and probes each declared disk via cryptsetup isLuks and cryptsetup luksUUID.
If the pool is mounted, queries btrfs filesystem df and btrfs device usage --raw to check RAID profile consistency and metadata allocation headroom, probes for missing devices, reconciles each live pool member’s LUKS UUID against pool.json to flag foreign devices, and runs btrfs balance status to detect paused balances.
For each declared disk, runs smartctl --json -A -l selftest <device> – the member’s live backing device when it is assembled into the mounted pool, otherwise its persisted by-id path (including a member that is missing or unassembled on a degraded but mounted pool) – and parses the self-test log to detect active failures and report the age of the most recent passing entry. See ADR-024 for why present members are probed by live path rather than by-id.
If the braid monitor NixOS module is configured, reports the alert beep check as skipped by default.
With --beep, plays a short test beep through the canonical beep wrapper.
If UPS support is enabled, checks upsc and the mounted-pool braid-online.service shutdown hook.
On systemd_lifecycle installs, probes whether the pool mount point is mounted and whether its inode carries the immutable attribute, then classifies the pair: offline+mutable -> Warn with a re-seal hint, mounted+immutable -> Fail, and the healthy pairs -> Ok. Standalone CLI installs (systemd_lifecycle unset, no NixOS module) skip it – the boot seal exists only under the module. See ADR 028.
If auto-suspend is enabled, runs ethtool <interface> to verify runtime Wake-on-LAN state.
Aggregates results and prints a summary.

status – live pool health, disk usage, scrub status
monitor – automated health check for alerting
seal-mountpoint – set or clear the offline mountpoint seal that the mountpoint_immutable check verifies

Keyboard shortcuts

braid