Active – Supersedes 002-config-first-workflow.md. Refined by 024-luks-uuid-identity.md.
Decision: Runtime Disk Membership
Principle: CLI-owned membership
Context
The original design declared disk membership in braid.disks (NixOS config). Adding a drive required editing Nix config, running nixos-rebuild switch, then running braid add <name>. This was wrong: disk membership is operational state (“which drives are in my pool right now”), not system architecture (“what services should run on this machine”). Requiring a rebuild to add a drive added ceremony and created a category error — NixOS config is for declarative system shape, not mutable runtime state.
Decision
Move disk membership to a CLI-owned runtime state file. The NixOS module provides infrastructure (mount point, services, toolchain). The CLI owns which disks are in the pool.
State model
/var/lib/braid/pool.json — CLI-owned membership keyed by LUKS UUID:
{
"disks": {
"11111111-1111-1111-1111-111111111111": {
"name": "toshiba",
"by_id": "/dev/disk/by-id/ata-TOSHIBA_...",
"devid": 1,
"added_at": "2026-03-27T12:00:00Z"
}
}
}
The map key is the member’s persistent identity. The name field is the operator-facing disk name used in commands, mapper names, and labels; it is not the identity. by_id is the hardware address used to find the disk before it is opened. devid is live btrfs state captured after membership commits and is only a fallback binding key when btrfs reports a missing or null-underlying device by devid alone. added_at is historical state – once set on a member, it is preserved across all subsequent writes (unlock, recover, replace, add, etc.). These fields replace the former disk-map.json advisory file.
/etc/braid/config.json — machine config (no disk information):
{ "mount_point": "/mnt/storage" }
Standalone CLI installs may keep this minimal shape. Module-generated configs
also include pool_access_group and systemd_lifecycle:
{
"mount_point": "/mnt/storage",
"pool_access_group": "storage",
"systemd_lifecycle": true
}
/var/lib/braid/pending-op.json — pending-operation journal (transient, present only during mutations).
Mutation ordering
All mutating commands validate, write pending-op.json with pre/target membership snapshots, perform the irreversible btrfs membership change, write pool.json to reflect the committed live membership, then advance the journal to a post-maintenance phase before performing any required post-mutation maintenance and clearing the journal.
pool.json reflects committed btrfs membership, not necessarily completion of follow-up maintenance such as RAID1 rebalance or resize. While pending-op.json exists, braid recover is responsible for replaying or completing any owed post-mutation work before clearing the journal when the balance state is safe to interpret. If owed RAID1 replay finds a paused, running, or unknown btrfs balance, recover fails closed and preserves the journal for manual inspection. Recovery in a post-maintenance phase must not rerun the primary btrfs membership command (device add, device remove, or replace start).
For add, membership commits when btrfs device add returns success; the post-add RAID1 balance is follow-up maintenance. For remove, membership commits when btrfs device remove returns success; writing pool.json before that would be wrong because btrfs still owns the device. For remove-missing, membership commits when btrfs device remove <devid> against the missing devid returns success; the post-remove soft balance that restores RAID1 redundancy for chunks created during degraded operation is follow-up maintenance. For replace, membership commits when btrfs replace start -B completes; the post-replace resize, and (for missing-path replacements that clear the last missing device) the soft balance, are follow-up maintenance.
The journal provides crash safety: if braid crashes mid-operation, the journal triggers recovery mode on next invocation. If a crash lands after pool.json was written but before the post-maintenance phase rewrite, braid recover detects the committed live topology, rewrites the journal to the post phase, and then finishes only the owed maintenance unless owed RAID1 replay finds a paused, running, or unknown balance state.
Recovery mode
When pending-op.json exists, braid enters recovery mode. Membership, mount, and key-enrollment commands (add, remove, remove-missing, replace, unlock, enroll, discover --write) hard-fail; read-only diagnostic and cleanup surfaces (status, doctor, lock, bare discover) stay available. braid recover is the only command that clears the journal: it opens LUKS devices, mounts the pool (with --allow-degraded if needed), and rebuilds or repairs membership from the live btrfs pool topology – not from LUKS label scanning, which could include labeled-but-never-added disks.
State contract
pool.jsonis authoritative.unlockrequires it.unlockenrichespool.jsonmetadata (devid,added_at, and current by-id observations where appropriate) after mount via live btrfs state, but never changes membership (disk set).- If
pool.jsonis missing or corrupt,unlockand the mutating membership commands fail with a clear error directing the user tobraid addorbraid discover --write. braid lock– the user-facing command, thebraid-online.serviceExecStop reentry, andbraid lock --dry-run– tolerates a missing or corruptpool.json: it warns and proceeds with empty membership. The per-candidatecryptsetup luksUUIDprobe inbuild_close_sets_*(cli/src/lock.rs) is the fail-closed guard, so cleanup remains complete and correct. No lock pathway hard-fails on an unloadablepool.json; dry-run folds the warning into its stdout preview while the real paths emit it to stderr (see ADR 026).- If
pool.jsonis readable but stale (a member fails to probe),unlockwarns and proceeds with the members it can probe. It never rewritespool.json. - If a member’s UUID key doesn’t match the probed device’s LUKS UUID,
unlockfatally errors. This catches swapped, reformatted, or corrupted drives before any LUKS open or mount is attempted. - Only these commands write
pool.jsonmembership:add,remove,replace,remove-missing,discover --write,recover.
Recovery
Recovery is always explicit, never implicit:
braid recoveropens LUKS devices and mounts the pool if needed. Mount membership is phase-specific: existing-pool add and remove-missing pool-mutation phases mount from the pre-operation membership, add/remove-missing post phases and replace post-maintenance recovery mount from the committed target membership, and replace pool-mutation, bootstrap-add pool-mutation (empty pre-operation snapshot), and plainremoverecovery mount from the admission membership (pre-operation snapshot plus target-only members, which for replace covers an in-flightdev_replace). This is the only path out of recovery mode (journal present). It probes actual pool topology, not LUKS labels. Each live member’sby_idis resolved at recovery time by walking/dev/disk/by-id/and matching the symlink whose canonical target equals the live device’s backing kernel path –by_idis never copied from the journal snapshot, which can be stale if hardware enumeration changed since the mutation started. If no by-id symlink resolves to a live pool member, recovery hard-fails with an actionable remediation message rather than persisting a guess. When rebuildingpool.json, recover preserves each member’sadded_atfrom the currentpool.jsonif present, else from the journal’s pre/target membership snapshot; only members with no prior timestamp get a freshnow_iso()stamp.by_id, the UUID key, anddevidremain live-derived or journal-verified according to the recovery phase. When the pool is already mounted by an external process (circumventingbraid unlock’s pending-op preflight) and the journal recordsReplace::PoolMutation, recovery refuses and directs the operator tobraid lock; braid recoverso a fresh mount session can be opened and the relock cycle can clear any kernel-resumed-dev_replacestaleness. Replace post-maintenance recovery is allowed on an already-mounted pool because the primary replace has already committed.braid discoverscans/dev/disk/by-id/*for LUKS devices withbraid-*labels. Displays what it finds. With--write, persists topool.json. This is for initial setup recovery (lost pool.json), not for crash recovery.- The normal path to create
pool.jsonisbraid add.
CLI syntax
braid add takes name=by_id positional pairs:
braid add toshiba=/dev/disk/by-id/ata-TOSHIBA wd=/dev/disk/by-id/ata-WDC
braid replace --new takes the same format:
braid replace --old toshiba --new seagate=/dev/disk/by-id/ata-Seagate_NEW
Lifecycle model
The NixOS module no longer generates data-pool fileSystems, LUKS entries, or btrfs-device-scan. Instead:
braid-online.service— lifecycle owner (ExecStop=braid lock,RemainAfterExit=yes). Started by Rust dispatch viamark_onlineafter a successfulunlock,add, orrecoverthat leaves the pool mounted, gated onsystemd_lifecycle = truein runtime config.braid-pool.target— wants unlock only, does not startbraid-onlinedirectly.- Consumer services bind to
mnt-storage.mount(auto-generated by systemd from/proc/mounts).
Rejected alternatives
- Keep
braid.disksbut make it optional — half-measure that leaves two sources of truth. Users would be confused about which one matters. - Auto-discover on unlock — makes
unlocka mutation command. If discovery finds the wrong devices (e.g., a test disk with abraid-*label), the pool is corrupted silently. Explicit membership is safer. - Store membership in btrfs metadata — btrfs doesn’t have a user-data field on devices. Would require a convention (e.g., subvolume with a JSON file), adding fragility and a chicken-and-egg problem for
unlock.
Consequences
- Adding a drive is one command:
braid add name=/dev/disk/by-id/.... Nonixos-rebuild. pool.jsonmust exist beforeunlockcan run. First-time setup:braid addcreates it.braid discover --writeis the explicit recovery path for lost/corruptpool.json.- The NixOS module’s
braid.disksoption is removed entirely.
See
cli/src/membership.rs– load/save/validate membership,DiskMember,PoolMembership,enrich_from_pool_state,foreign_luks_uuids(pure helper consumed bybraid doctor’sforeign_luks_uuidcheck)cli/src/journal.rs— pending-operation journal (pre/target membership snapshots)cli/src/recover.rs— rebuild membership from live pool statecli/src/preflight.rs—check_no_pending_operationrecovery mode guardcli/src/discover.rs— LUKS label scanningmodules/braid/storage.nix—braid-online.service, no data-poolfileSystemsmodules/braid/options.nix— nobraid.disks