Testing notes
Test conventions and NixOS VM test framework reference for braid. The short three-bullet preamble contract (Intent / Why it exists / Scenario) lives in AGENTS.md at the repo root; everything else – the literal preamble form, the flake.nix registration rule, framework gotchas, and patterns – is here. For the lifecycle test suite see tests/module/systemd-lifecycle.py. For the Rust-level TUI view snapshot tests (insta-based, run via just test-rust), see tui-snapshots.md.
Conventions
Preamble: literal // line-comment form
Every test’s preamble is a contiguous block of // line comments directly above the test item.
- Intent — what behavior this test verifies (or tries to verify)
- Why it exists — what risk/regression this protects against
- Scenario — the real-world user/system story this models, especially the concrete bug or incident that inspired the test
#![allow(unused)]
fn main() {
// Intent: one-line statement of the behavior verified.
// Why it exists: the regression risk this protects against, ideally with
// reference to the incident or commit that prompted it.
// Scenario: the concrete real-world sequence the test models.
#[test]
fn the_test() { ... }
}
New VM tests must register in flake.nix
just test-vm and just test-all build whatever is registered under checks.<system> in flake.nix – there is no default per-test list in the justfile. When adding a new tests/cli/*.nix or tests/module/*.nix, also add a matching pkgs.testers.nixosTest (import ./tests/cli/<name>.nix { braid = linuxCrane.braid; }) entry to flake.nix. An unregistered test sits in the tree but never runs under nix flake check.
VM-test framework gotchas
just test-repro requires the full repro- prefix
just test-repro <name> and just test-vm <name> pass the test name verbatim to nix as a final attribute selector. The reproChecks flake output is built by filterAttrs keeping the repro- prefix in the filtered set, so the attribute name passed to just test-repro must be exactly the name in flake.nix, prefix and all.
# correct
just test-repro repro-btrfs-replace-interrupted-mid-flight
# wrong -- fails with "flake ... does not provide attribute ... reproChecks.aarch64-darwin.btrfs-replace-interrupted-mid-flight"
just test-repro btrfs-replace-interrupted-mid-flight
The test-vm checks set strips entries with the repro- prefix, so test-vm test names do not have a prefix (e.g. cli-recover-replace-completed).
NixOS test driver wraps every command with set -euo pipefail
The driver auto-prepends set -euo pipefail to every machine.succeed / machine.execute command before sending it to the VM. This is invisible from the test script but has real consequences for chained commands.
Symptom: A chain like ... ; wait $pid_loser ; echo $? > /tmp/exit-a ; ... silently aborts when wait returns non-zero. The exit-code file is never written, and the next subtest assertion fails with cat: /tmp/exit-a: No such file or directory – pointing at the wrong layer.
Idiom for capturing a non-zero exit without aborting:
ec_a=0 ; wait $pid_a || ec_a=$? ; echo $ec_a > /tmp/exit-a
The || consumes the non-zero into the variable, so errexit does not fire. Works for any command whose non-zero exit is expected (wait, grep, diff, etc.). This matters most in concurrent-process tests where one process is expected to exit non-zero (fail-fast lock contention, expected error paths).
Python f-strings without placeholders fail the build-time linter
NixOS VM test scripts are linted at build time. f-strings without {placeholder} variables (e.g. f"Missing foo in config") cause a build failure: f-string is missing placeholders.
In tests/**/*.py, never use f"..." without at least one {variable} inside. Use "literal" + variable for assertion messages that include dynamic values.
Patterns
Regression test quality
Regression tests must fail when the bug is reintroduced. Test the layer where production failed, not a downstream parser or helper that only proves later code works when given correct input.
For error propagation, assert the typed variant and payload. Use exact rendered
strings only for tests whose purpose is to lock Display or user-facing
output. If a change reclassifies an error, production and tests should call the
same mapping helper; do not hand-build the target variant in the test.
For user-visible CLI output or control-flow bugs, prefer a CLI/VM test that
drives the real command. If stdout vs stderr matters, capture them separately
with >stdout 2>stderr; merged streams do not pin routing. Render or preview
helpers that form a user-visible boundary need exact-output coverage for every
branch, including no-op branches.
Keep repro tests focused. If adjacent behavior already has dedicated coverage, cite that test instead of bundling another phase into a repro whose failure would become ambiguous.
When a dead test has a name that points at a real user-visible contract, replace it with a real regression test by default. Deleting the dead test turns bad coverage into no coverage.
Live-tool behavior locks
When braid code is changed to depend on a specific external-tool behavior – a particular exit code, a particular output wording, a particular return-value path – mocked unit tests prove the classifier is correct given the assumed behavior, but they do NOT prove the tool still behaves that way. A nixpkgs bump that changed cryptsetup’s exit-code contract would silently misclassify in production while every mocked test still passed.
Whenever a plan introduces a classifier of the form exit_code == <N> or stderr.contains("<wording>") against an external tool, identify (or add) a live-tool repro/VM test that asserts the same code/wording directly. List that test in the plan’s verification section as a required gate. If the live-tool test would be non-trivial to add, pause and reconsider whether the classifier is actually robust.
This is the same family as braid’s parser-compatibility lanes (just test-parsers, just test-rust-unstable, see parser compatibility) – those lock the parser against tool-output drift; a behavior-lock test locks an exit-code or wording classifier against the same drift surface. Reference example: tests/repro/cryptsetup-close-mounted.py asserts exit_code == 5 for busy-close and exit_code == 4 for already-closed, behavior-locking the assumption that cli/src/lock.rs retry classifier depends on.
VM and command test design
Before inventing VM setup for missing disks, degraded mounts, ENOSPC, hotplug,
or similar storage state, search tests/cli/, tests/repro/, and tests/hw/
for an existing pattern and reuse it where it fits.
Before proposing a VM test for a mutating command, search the same area for existing notes that say a shape is infeasible, and read sibling tests to learn which seams already exist.
For ordering invariants like “persist state before post-operation maintenance”, prefer a deterministic command-layer failure-injection test: allow the persistence step to succeed, force the next maintenance step to fail, then assert the persisted state is current and the journal still exists.
When code touches kernel async workers, mount-session caches, or device-layer teardown, mocked unit tests are not enough. Run the relevant VM or repro test, inspect full logs when it fails, and repeat timing-sensitive repros enough to rule out a lucky pass.
For cmd_* boolean gates derived from multiple inputs, route both branches
through the same injected seam and test the matrix cells that distinguish the
intended gate from plausible wrong gates.
For one-off sequenced or stateful command-test behavior, prefer a file-local
runner or wrapper over widening the shared MockRunner. Reserve shared runner
API changes for behavior that many tests need.
When removing sleep wall-time from tests, inject a sleeper dependency. Do not
use #[cfg(test)] to zero a production timing constant whose value is part of
the behavior.
Eval-time test isolation: disable, don’t stub
When an eval-time test (lib.evalModules in isolation) breaks because of a new NixOS option dependency, disable the unrelated feature in the test config rather than expanding the fake module surface with stubs.
Stubbing options (e.g. adding options.users) makes the test less isolated and can mask future accidental dependencies on unrelated NixOS top-level options. Disabling the feature that introduced the dependency keeps the test focused.
When fixing eval-time test failures caused by new module dependencies, first check if the dependency comes from a feature the test doesn’t need. If so, set that feature’s config to its “off” value (e.g. poolAccessGroup = null) instead of adding option stubs.