btrfs dev_replace resume-on-mount and the recover relock cycle
Background
A btrfs replace interrupted mid-flight by an unclean crash leaves the
on-disk dev_replace_item in STARTED. On the next mount, the kernel sees
that state and resumes the replace from the on-disk cursor.
For braid, this matters during braid recover: the command may be the first
thing to mount the pool after the crash, so it is also the thing that triggers
the kernel’s resume-on-mount path.
Kernel resume-on-mount behavior
btrfs_resume_dev_replace_async runs as a detached kthread. umount does not
wait for that worker.
The worker commits the post-completion devid swap to disk correctly, but it
does not update the in-memory btrfs_fs_devices for the mount session that
triggered the resume. A probe taken from that session reads stale topology: a
phantom MISSING devid 0 plus both the source and target devices. In the
captured failure, that meant five device entries and braid status reporting
DEGRADED even though every disk was online.
Why the LUKS close+reopen is load-bearing
The important empirical result is narrower than “remount after replace”:
umount + btrfs device scan --forget + remount is not enough if the dm devices
stay alive. That cycle can leave the cached fs_devices attached to the live dm
devices, so the next probe still sees the stale topology from the original
resume-triggering mount.
Only tearing down and recreating the dm devices forces the kernel to re-read the
chunk tree from disk and build a fresh fs_devices that reflects the
post-resume on-disk state.
What braid recover does
Recover splits this into two explicit work actions:
RecoverWorkAction::WaitForKernelReplace and
RecoverWorkAction::RemountCycle.
First, cli/src/recover.rs#wait_for_kernel_replace_to_finish polls
btrfs replace status until the kernel reports Finished or no replace is in
progress. Running is intentionally unbounded because interrupting the kernel
worker would strand the same recovery problem. Suspended or unparseable output
fails closed and preserves the journal.
Then cli/src/recover.rs#relock_and_remount runs the full relock cycle:
umount, btrfs device scan --forget, close the LUKS membership union, reopen
the pool through the standard plan_open_pool flow, and remount through the
standard executor. The second mount sees the completed on-disk replace with a
fresh fs_devices.
Coverage
tests/repro/btrfs-replace-interrupted-mid-flight.py pins the unclean-kill
path end-to-end. It starts a real braid replace, kills the VM mid-flight,
boots again, runs braid recover, and asserts that the resumed replace drains,
pool.json swaps in the new disk, the old disk is evicted, and a later
braid lock; braid unlock cycle stays clean.
Path B: v6.19+ freeze/signal cancellation
The unclean-kill repro does not cover the v6.19+ freeze and signal cancellation
path inside the btrfs replace worker loop. An unclean kernel kill bypasses the
in-loop try_to_freeze and fatal_signal_pending checks entirely.
A sibling repro test is needed when kernel >= 6.19 reaches NixOS stable. Its
sequencing depends on whether braid replace should inhibit suspend for the
operation’s duration; that policy question is orthogonal to the crash path this
note documents.