smartd alert conditions

Reference for what triggers smartd to call the notification script.

braid’s current smartd config

-a -o on -S on -m <nomailer> -M exec ${smartdAlertScript}

-a expands to: -H -f -t -l error -l selftest -l selfteststs -C 197 -U 198

-o on and -S on are non-monitoring config flags (enable offline testing and attribute autosave on the drive).

Wired in modules/braid/monitor.nix (search for smartdAlertScript).

SATA: conditions that fire the alert script

smartd polls every 30 minutes. Each condition has a SMARTD_FAILTYPE value passed to the script.

SMARTD_FAILTYPE	Directive	Trigger
`Health`	`-H`	Overall SMART health status = FAILING
`Usage`	`-f`	Any Usage (Old_age) attribute value <= vendor threshold
`ErrorCount`	`-l error`	ATA error log count increased since last poll
`SelfTest`	`-l selftest`	New self-test failures detected
`CurrentPendingSector`	`-C 197`	Non-zero raw value on attr 197
`OfflineUncorrectableSector`	`-U 198`	Non-zero raw value on attr 198
`FailedHealthCheck`	`-H`	SMART health command itself failed
`FailedReadSmartData`		Could not read SMART attribute data
`FailedReadSmartErrorLog`		Could not read SMART error log
`FailedReadSmartSelfTestLog`		Could not read self-test log
`FailedOpenDevice`		`open()` failed – device disappeared
`Temperature`	`-W`	Temperature >= CRIT threshold (NOT in `-a`, must be added explicitly)

SATA: what `-a` does NOT alert on

These are only logged to syslog, not sent to the script:

Reallocated_Sector_Ct (5) raw value increases – only alerted if value crosses the vendor threshold (via -f). To alert on raw value changes, add -R 5!.
Reported_Uncorrect (187), End-to-End_Error (184), Reallocated_Event_Count (196) – same: threshold breach only via -f, no raw-value alerts.
Temperature – not monitored at all without -W DIFF,INFO,CRIT.
Prefail/Usage attribute value changes – -t (= -p -u) logs these to syslog at LOG_INFO, but does not fire the script.

SATA: syslog-only directives (no script trigger)

Directive	What it monitors
`-p`	Prefail attribute value changes (LOG_INFO)
`-u`	Usage attribute value changes (LOG_INFO)
`-t`	All attribute changes (= `-p -u`)
`-r ID`	Report raw value alongside normalized (informational)
`-R ID` (without `!`)	Track raw value changes (LOG_INFO, no email)
`-R ID!` (with `!`)	Track raw value changes (LOG_CRIT + fires script)
`-l offlinests`	Offline Data Collection status changes (LOG_CRIT, no email)
`-l selfteststs`	Self-Test execution status changes (LOG_CRIT, no email)

NVMe: how `-a` works differently

NVMe has a standardized health model – no vendor-specific attribute IDs or thresholds. The ATA-only parts of -a (-C 197, -U 198, -o on, -S on) are silently ignored.

NVMe conditions that fire the alert script

SMARTD_FAILTYPE	Directive	Trigger
`Health`	`-H`	Critical Warning byte != 0 (any bit set)
`Usage`	`-f`	Percentage Used > 95% or Media and Data Integrity Errors increased
`ErrorCount`	`-l error`	Error Information Log Entries count increased (device-related errors only, since smartmontools 7.4)
`SelfTest`	`-l selftest`	New self-test failures (requires smartmontools 7.5+)
`FailedHealthCheck`	`-H`	SMART health command itself failed
`FailedReadSmartData`		Could not read SMART data
`FailedReadSmartErrorLog`		Could not read error log
`FailedReadSmartSelfTestLog`		Could not read self-test log
`FailedOpenDevice`		`open()` failed – device disappeared

The Critical Warning byte (`-H`)

A bitmask where any bit set fires the alert:

Bit	Meaning
0	Available spare fallen below threshold
1	Temperature above/below acceptable range
2	Reliability degraded (excessive writes beyond warranty)
3	Media placed in read-only mode
4	Volatile memory backup (power-loss protection capacitor) failed

As of smartmontools 7.5, -H MASK (hex) can ignore specific bits, e.g. -H 0xfb ignores bit 2 (reliability/warranty warning).

NVMe syslog-only tracking (no script trigger)

Directive	What it monitors
`-p`	Available Spare changes (LOG_INFO)
`-u`	Percentage Used and Media Errors changes (LOG_INFO)
`-t`	All of the above (= `-p -u`)
`-l selfteststs`	Self-test execution status changes (LOG_CRIT, no email)

NVMe monitoring is more straightforward because the spec defines exactly what “unhealthy” means, whereas SATA relies on vendor-specific attribute definitions and generously-set thresholds. The same -a config line works for both – smartd adapts per device type.

SATA attributes worth monitoring

Based on real Seagate SATA output.

Reliable indicators (unambiguous, no vendor-encoding issues)

ID	Name	Notes
5	`Reallocated_Sector_Ct`	Sectors remapped due to read errors
184	`End-to-End_Error`	Internal data path integrity failure. Non-zero = serious.
187	`Reported_Uncorrect`	Uncorrectable errors reported to host. Non-zero = data loss occurred.
196	`Reallocated_Event_Count`	Remap operations (complements attr 5). Non-zero = active reallocation.
197	`Current_Pending_Sector`	Sectors waiting to be remapped
198	`Offline_Uncorrectable`	Sectors unreadable during offline test

Useful but with caveats

ID	Name	Notes
10	`Spin_Retry_Count`	Failed spin-up. Non-zero = mechanical trouble.
188	`Command_Timeout`	High values = dying drive, but some timeouts normal during power events.

Avoid using raw values for comparison

ID	Name	Why
1	`Raw_Read_Error_Rate`	Seagate packs composite value (errors in lower bits, total ops in upper). Raw number is meaningless for threshold comparison. Other vendors vary too.
7	`Seek_Error_Rate`	Same Seagate composite encoding.

Not disk errors

ID	Name	Why
191	`G-Sense_Error_Rate`	Shock sensor. Low values normal for a moved drive.
193	`Load_Cycle_Count`	Wear indicator, not an error.
199	`UDMA_CRC_Error_Count`	Almost always a cable/connection problem, not the drive.

Relationship to braid’s live SMART classifier (`SmartEvidence`)

braid runs its own live SMART probe: parse_smartctl (in cli/src/parse/smartctl.rs) builds a SmartEvidence from smartctl -H -A --json output, reading the raw values of 3 ATA attributes: Reallocated_Sector_Ct, Current_Pending_Sector, Offline_Uncorrectable (plus the NVMe health-information log on NVMe drives). This verdict now feeds both braid status and the TUI – the same per-disk probe surfaces in status output (the smart JSON object and the SMART: text line) and in the TUI disk-detail panel.

This is complementary to smartd, not a replacement: smartd handles real-time alerts (with its own set of checks), while braid’s classifier gives at-a-glance diagnostic status. Critically, the live classifier is diagnostic only – a degraded SmartEvidence never raises an AlertCause. smartd remains the sole SMART alert source (it writes the smartd-alert flag that drives AlertCause::SmartdAlert); see ADR 014 and ADR 030. The two SMART signals don’t need to be identical but should cover the same ground between them.

Keyboard shortcuts

braid