Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Decision: Resilient by Default

Principle: Resilient by default

Context

The OS lives on an internal SSD. Data drives are separate. Nothing about the data drives — bad config, dead drive, unplugged cable — should prevent the system from booting. The data pool is an external resource, like a network mount. The module tries to bring it up, but if it fails, the box is still a working Linux machine you can SSH into and fix.

Options considered

  1. Hard dependencies — LUKS required, mount required. Any failure blocks boot. Simple but means a dead drive = unreachable NAS.
  2. Degraded toggle — add an option like braid.allowDegraded = true. Default to hard failure, opt in to resilience. Adds complexity and a wrong default.
  3. Resilient by defaultnofail, wants everywhere. Zero cost when healthy, graceful in every failure case. No toggle. Degraded mounts require explicit opt-in (--allow-degraded or autoUnlock.allowDegraded) to prevent silent zero-redundancy operation.

Decision

Option 3. Resilience is the default, not an option.

Implementation

LUKS unlock is strictly stage-2 — braid-unlock or braid-auto-unlock opens LUKS and mounts the pool. The module does not generate boot.initrd.luks.devices, data-pool fileSystems entries, or LUKS device declarations. The pool is brought online entirely by the CLI at runtime.

Resilience mechanisms:

  • No boot-blocking mount units: The module generates no data-pool fileSystems or LUKS entries. The CLI (braid unlock) opens LUKS and mounts btrfs directly with a plain mount call, so nothing referencing data drives can block boot. Mounting outside systemd also sidesteps the SYSTEMD_READY=0 udev quirk (systemd/systemd#36886): a missing btrfs member can mark surviving devices not-ready and stall a systemd-initiated mount — the exact failure resilience-by-default exists to prevent. Related coverage: tests/repro/udev-missing-disk-{io,idle}.py exercise udev events when a member disappears from an already-mounted pool, characterizing disappearance signals rather than the SYSTEMD_READY=0 mount-gating path. (The one build-time fileSystems entry is the optional autoUnlock USB-key mount at /run/braid-key/mnt, marked noauto/nofail so it never blocks boot and references the key device, not the pool.)
  • Degraded mount: Requires explicit --allow-degraded (or autoUnlock.allowDegraded for unattended use) — braid refuses to silently mount with zero redundancy.

Three-tier failure model

ScenarioWhat happensUser sees
All drives healthyNormal bootEverything works
One drive deadbraid unlock refuses by default; user must pass --allow-degraded or configure autoUnlock.allowDegradedPool stays locked until explicit opt-in
All drives dead / no pool.jsonbraid unlock fails (no devices to probe)System boots, SSH works, no /mnt/storage

Identity enforcement

braid unlock uses authoritative pool membership from pool.json and probes only those configured members. --allow-degraded only bypasses degraded-mount refusal; it does not change which disks are considered pool members.

Constraint

This is not configurable. There is no braid.resilient option. Every braid deployment gets resilient boot.

See

  • modules/braid/storage.nixbraid-online.service, braid-pool.target
  • tests/module/ — module tests validate boot with all drives healthy
  • archive/plans/test-boot-degraded.md — original plan and research (preserved in git history; last present at commit 9df91f9)