Decision: Resilient by Default
Principle: Resilient by default
Context
The OS lives on an internal SSD. Data drives are separate. Nothing about the data drives — bad config, dead drive, unplugged cable — should prevent the system from booting. The data pool is an external resource, like a network mount. The module tries to bring it up, but if it fails, the box is still a working Linux machine you can SSH into and fix.
Options considered
- Hard dependencies — LUKS required, mount required. Any failure blocks boot. Simple but means a dead drive = unreachable NAS.
- Degraded toggle — add an option like
braid.allowDegraded = true. Default to hard failure, opt in to resilience. Adds complexity and a wrong default. - Resilient by default —
nofail,wantseverywhere. Zero cost when healthy, graceful in every failure case. No toggle. Degraded mounts require explicit opt-in (--allow-degradedorautoUnlock.allowDegraded) to prevent silent zero-redundancy operation.
Decision
Option 3. Resilience is the default, not an option.
Implementation
LUKS unlock is strictly stage-2 — braid-unlock or braid-auto-unlock opens LUKS and mounts the pool. The module does not generate boot.initrd.luks.devices, data-pool fileSystems entries, or LUKS device declarations. The pool is brought online entirely by the CLI at runtime.
Resilience mechanisms:
- No boot-blocking mount units: The module generates no data-pool
fileSystemsor LUKS entries. The CLI (braid unlock) opens LUKS and mounts btrfs directly with a plainmountcall, so nothing referencing data drives can block boot. Mounting outside systemd also sidesteps theSYSTEMD_READY=0udev quirk (systemd/systemd#36886): a missing btrfs member can mark surviving devices not-ready and stall a systemd-initiated mount — the exact failure resilience-by-default exists to prevent. Related coverage:tests/repro/udev-missing-disk-{io,idle}.pyexercise udev events when a member disappears from an already-mounted pool, characterizing disappearance signals rather than theSYSTEMD_READY=0mount-gating path. (The one build-timefileSystemsentry is the optionalautoUnlockUSB-key mount at/run/braid-key/mnt, markednoauto/nofailso it never blocks boot and references the key device, not the pool.) - Degraded mount: Requires explicit
--allow-degraded(orautoUnlock.allowDegradedfor unattended use) — braid refuses to silently mount with zero redundancy.
Three-tier failure model
| Scenario | What happens | User sees |
|---|---|---|
| All drives healthy | Normal boot | Everything works |
| One drive dead | braid unlock refuses by default; user must pass --allow-degraded or configure autoUnlock.allowDegraded | Pool stays locked until explicit opt-in |
| All drives dead / no pool.json | braid unlock fails (no devices to probe) | System boots, SSH works, no /mnt/storage |
Identity enforcement
braid unlock uses authoritative pool membership from pool.json and probes only those configured members. --allow-degraded only bypasses degraded-mount refusal; it does not change which disks are considered pool members.
Constraint
This is not configurable. There is no braid.resilient option. Every braid deployment gets resilient boot.
See
modules/braid/storage.nix—braid-online.service,braid-pool.targettests/module/— module tests validate boot with all drives healthyarchive/plans/test-boot-degraded.md— original plan and research (preserved in git history; last present at commit9df91f9)