ENOSPC vs hang: reproducing btrfs device remove failures in VMs
Background
btrfs device remove missing has two failure modes when surviving devices
lack space for relocation. Both are bad, but the second is catastrophic.
Failure mode 1: instant ENOSPC
Conditions: surviving devices have zero (or near-zero) unallocated space.
btrfs can’t even begin relocating block groups. It fails immediately:
ERROR: error removing device 'missing': No space left on device
Filesystem stays healthy and writable. Annoying but recoverable.
How to reproduce in a VM: 3×512MiB disks, fill to 100% capacity, kill
one disk. btrfs device remove missing fails in under a second.
Failure mode 2: partial relocation → transaction abort → forced read-only
Conditions: surviving devices have SOME unallocated space (hundreds of MiB) but not enough to relocate ALL block groups from the dead device.
btrfs starts relocating, successfully moves some block groups (consuming the free space), then hits ENOSPC mid-transaction on a subsequent block group. The transaction abort forces the entire filesystem read-only:
BTRFS info: relocating block group 4761583616 flags data|raid1
BTRFS info: found 20 extents, stage: move data extents
BTRFS info: found 20 extents, stage: update data pointers
BTRFS info: relocating block group 3419406336 flags metadata|raid1
BTRFS: Transaction aborted (error -28)
BTRFS: error in __btrfs_free_extent: errno=-28 No space left
BTRFS info: forced readonly
The error reported to the user is “Read-only file system” — the ENOSPC is buried in dmesg. The filesystem is destroyed (forced read-only) and requires remounting or rebooting to recover.
On real hardware with slow USB drives, btrfs doesn’t crash quickly — it
spends hours doing I/O, throttled by writeback queuing (wbt_wait), retrying
the same block groups before eventually aborting. In a VM with fast virtual
disks, the same sequence completes in ~40 seconds.
What makes the difference
The variable is whether btrfs can begin relocating:
| Free space on survivors | btrfs behavior | Outcome |
|---|---|---|
| ~0 | Can’t start → instant ENOSPC | Filesystem OK |
| Some but not enough | Starts, partially succeeds, then ENOSPC mid-transaction | Filesystem destroyed (forced read-only) |
| Enough | Completes relocation | Success |
The dangerous middle case is the one that happened in the real incident (3×8GiB USB drives, ~80% full, one died).
How braid avoids this
braid’s mutation preflight refuses these removals — before the pending-op
journal is written — whenever it can prove the survivors lack the space to
absorb the target’s allocations. The degraded failure-mode-2 path is fully
guarded: remove-missing and the 2→1 eviction are fail-closed, so an
operator using braid does not reach the catastrophic path above. The
healthy >=2-survivor case is intentionally warn-and-proceed on an
unprovable check, because it falls through to btrfs’s clean failure mode
1, never the mode-2 abort. Per path:
remove-missing— the degraded failure-mode-2 scenario exactly. Computes RAID1 chunk-pair capacity on the survivors and refuses when it is below the chunks allocated on the missing device. Fail-closed: any probe or parse uncertainty also refuses (cli/src/preflight.rs::check_raid1_relocation_space, wired incli/src/remove_missing.rs).removeevicting to a single survivor (2→1) — RAID1 no longer applies, so braid instead checks the lone survivor can hold the post-conversiondata + 2 × metadata + 2 × system(single + DUP profile). Fail-closed (cli/src/preflight.rs#check_single_survivor_capacity). Enforced at plan time and re-validated as a pre-journal gate incli/src/remove.rs#RemovePlan::execute, closing the drift window where the pool keeps taking writes while the operator idles at the confirmation prompt — an over-committed survivor is then refused before the irreversible-fbalance, still with nopending-op.jsonstranded.removewith >=2 survivors (healthy) — same RAID1 relocation check, but warn-and-proceed on probe/parse uncertainty. A best-effort miss here falls through tobtrfs device remove, which hits the clean failure mode 1 (instant ENOSPC), not the failure-mode-2 abort, so the filesystem stays intact.replaceis not subject to this failure mode.btrfs replacerebuilds onto the new disk instead of relocating onto survivors; its preflight refuses a new disk smaller than the one being replaced (cli/src/preflight.rs::check_replace_target_capacity).
braid status and braid doctor surface a proactive advisory
(cli/src/capacity.rs::enospc_risk_advisory) one disk-loss before a pool
enters this danger zone.
The policy and its rationale are owned by ADR 012’s “ENOSPC pre-flight
check” section (docs/design/decisions/012-intent-cli.md). See also
docs/commands/remove-missing.md and the braid status ENOSPC advisory
(docs/commands/status.md).
Reproducing the hang/crash in a VM
The tricky part is getting btrfs to land in the “some but not enough” zone. Two challenges:
1. btrfs allocates unevenly across devices
Writing 2GiB of data to a 3-device RAID1 pool doesn’t give you ~667MiB allocated per device. btrfs allocates block groups in pairs (for RAID1), and the pair selection isn’t perfectly balanced. In testing:
disk1: Unallocated 1.00MiB ← nearly full
disk2: Unallocated 288.88MiB ← some room
With one device at ~0 free, btrfs can’t relocate anything there → instant ENOSPC (failure mode 1). To get failure mode 2, BOTH survivors need meaningful free space.
2. Block group granularity
btrfs allocates space in block groups (256MiB on small devices, 1GiB on
large ones). A single dd write of 200MiB might or might not trigger a new
block group allocation. Writing in smaller chunks (50MiB) gives btrfs more
allocation decisions, improving the chance of even distribution.
Working recipe (what the test does)
-
Use 4GiB disks — large enough for btrfs to create multiple data block groups per device, giving room for partial relocation.
-
Adaptive fill with small chunks — write 50MiB at a time, check
btrfs device usage --rawafter each write, stop when the minimum unallocated across all online devices drops below 800MiB. This targets the sweet spot: both survivors have 300-800MiB free. -
Use
--rawfor parsing —btrfs device usagedisplays values in human units (MiB, GiB) depending on magnitude.--rawgives bytes, avoiding unit-parsing bugs. -
Kill disk3, mount degraded, attempt
btrfs device remove missing— btrfs starts relocating, succeeds on one block group (~38s of I/O in VM), then crashes on the next with transaction abort.
What didn’t work
-
512MiB disks filled to 100%: instant ENOSPC (failure mode 1). No free space for btrfs to even begin.
-
2GiB disks with 200MiB write chunks: uneven allocation left one survivor with 1MiB free → instant ENOSPC again.
-
2GiB disks with adaptive fill: same uneven allocation problem. Not enough total capacity for btrfs to distribute block groups evenly across 3 device pairs.
-
Parsing
btrfs device usagewithout--raw: values display as MiB or GiB depending on size. On fresh 4GiB disks, unallocated shows as GiB; a regex matching only MiB found zero values → fill loop stopped immediately.
Test files
tests/repro/btrfs-remove-enospc.nix/.py— failure mode 1 (instant ENOSPC, 3×512MiB)tests/repro/btrfs-remove-enospc-crash.nix/.py— failure mode 2 (partial relocation crash, 3×4GiB)
These are repro tests that document actual btrfs behavior, not TDD tests.
They assert the real outcomes: instant ENOSPC with surviving filesystem, or
transaction abort with forced read-only. They invoke raw btrfs device remove missing rather than braid precisely because braid’s preflight (see
“How braid avoids this” above) refuses the operation under these conditions —
reproducing the unguarded btrfs behavior requires bypassing it. They live in
tests/repro/ — a folder reserved for tests that reproduce real-world
scenarios for our records.