Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

smartd alert conditions

Reference for what triggers smartd to call the notification script.

braid’s current smartd config

-a -o on -S on -m <nomailer> -M exec ${smartdAlertScript}

-a expands to: -H -f -t -l error -l selftest -l selfteststs -C 197 -U 198

-o on and -S on are non-monitoring config flags (enable offline testing and attribute autosave on the drive).

Wired in modules/braid/monitor.nix (search for smartdAlertScript).

SATA: conditions that fire the alert script

smartd polls every 30 minutes. Each condition has a SMARTD_FAILTYPE value passed to the script.

SMARTD_FAILTYPEDirectiveTrigger
Health-HOverall SMART health status = FAILING
Usage-fAny Usage (Old_age) attribute value <= vendor threshold
ErrorCount-l errorATA error log count increased since last poll
SelfTest-l selftestNew self-test failures detected
CurrentPendingSector-C 197Non-zero raw value on attr 197
OfflineUncorrectableSector-U 198Non-zero raw value on attr 198
FailedHealthCheck-HSMART health command itself failed
FailedReadSmartDataCould not read SMART attribute data
FailedReadSmartErrorLogCould not read SMART error log
FailedReadSmartSelfTestLogCould not read self-test log
FailedOpenDeviceopen() failed – device disappeared
Temperature-WTemperature >= CRIT threshold (NOT in -a, must be added explicitly)

SATA: what -a does NOT alert on

These are only logged to syslog, not sent to the script:

  • Reallocated_Sector_Ct (5) raw value increases – only alerted if value crosses the vendor threshold (via -f). To alert on raw value changes, add -R 5!.
  • Reported_Uncorrect (187), End-to-End_Error (184), Reallocated_Event_Count (196) – same: threshold breach only via -f, no raw-value alerts.
  • Temperature – not monitored at all without -W DIFF,INFO,CRIT.
  • Prefail/Usage attribute value changes-t (= -p -u) logs these to syslog at LOG_INFO, but does not fire the script.

SATA: syslog-only directives (no script trigger)

DirectiveWhat it monitors
-pPrefail attribute value changes (LOG_INFO)
-uUsage attribute value changes (LOG_INFO)
-tAll attribute changes (= -p -u)
-r IDReport raw value alongside normalized (informational)
-R ID (without !)Track raw value changes (LOG_INFO, no email)
-R ID! (with !)Track raw value changes (LOG_CRIT + fires script)
-l offlinestsOffline Data Collection status changes (LOG_CRIT, no email)
-l selfteststsSelf-Test execution status changes (LOG_CRIT, no email)

NVMe: how -a works differently

NVMe has a standardized health model – no vendor-specific attribute IDs or thresholds. The ATA-only parts of -a (-C 197, -U 198, -o on, -S on) are silently ignored.

NVMe conditions that fire the alert script

SMARTD_FAILTYPEDirectiveTrigger
Health-HCritical Warning byte != 0 (any bit set)
Usage-fPercentage Used > 95% or Media and Data Integrity Errors increased
ErrorCount-l errorError Information Log Entries count increased (device-related errors only, since smartmontools 7.4)
SelfTest-l selftestNew self-test failures (requires smartmontools 7.5+)
FailedHealthCheck-HSMART health command itself failed
FailedReadSmartDataCould not read SMART data
FailedReadSmartErrorLogCould not read error log
FailedReadSmartSelfTestLogCould not read self-test log
FailedOpenDeviceopen() failed – device disappeared

The Critical Warning byte (-H)

A bitmask where any bit set fires the alert:

BitMeaning
0Available spare fallen below threshold
1Temperature above/below acceptable range
2Reliability degraded (excessive writes beyond warranty)
3Media placed in read-only mode
4Volatile memory backup (power-loss protection capacitor) failed

As of smartmontools 7.5, -H MASK (hex) can ignore specific bits, e.g. -H 0xfb ignores bit 2 (reliability/warranty warning).

NVMe syslog-only tracking (no script trigger)

DirectiveWhat it monitors
-pAvailable Spare changes (LOG_INFO)
-uPercentage Used and Media Errors changes (LOG_INFO)
-tAll of the above (= -p -u)
-l selfteststsSelf-test execution status changes (LOG_CRIT, no email)

NVMe vs SATA summary

NVMe monitoring is more straightforward because the spec defines exactly what “unhealthy” means, whereas SATA relies on vendor-specific attribute definitions and generously-set thresholds. The same -a config line works for both – smartd adapts per device type.

SATA attributes worth monitoring

Based on real Seagate SATA output.

Reliable indicators (unambiguous, no vendor-encoding issues)

IDNameNotes
5Reallocated_Sector_CtSectors remapped due to read errors
184End-to-End_ErrorInternal data path integrity failure. Non-zero = serious.
187Reported_UncorrectUncorrectable errors reported to host. Non-zero = data loss occurred.
196Reallocated_Event_CountRemap operations (complements attr 5). Non-zero = active reallocation.
197Current_Pending_SectorSectors waiting to be remapped
198Offline_UncorrectableSectors unreadable during offline test

Useful but with caveats

IDNameNotes
10Spin_Retry_CountFailed spin-up. Non-zero = mechanical trouble.
188Command_TimeoutHigh values = dying drive, but some timeouts normal during power events.

Avoid using raw values for comparison

IDNameWhy
1Raw_Read_Error_RateSeagate packs composite value (errors in lower bits, total ops in upper). Raw number is meaningless for threshold comparison. Other vendors vary too.
7Seek_Error_RateSame Seagate composite encoding.

Not disk errors

IDNameWhy
191G-Sense_Error_RateShock sensor. Low values normal for a moved drive.
193Load_Cycle_CountWear indicator, not an error.
199UDMA_CRC_Error_CountAlmost always a cable/connection problem, not the drive.

Relationship to braid’s live SMART classifier (SmartEvidence)

braid runs its own live SMART probe: parse_smartctl (in cli/src/parse/smartctl.rs) builds a SmartEvidence from smartctl -H -A --json output, reading the raw values of 3 ATA attributes: Reallocated_Sector_Ct, Current_Pending_Sector, Offline_Uncorrectable (plus the NVMe health-information log on NVMe drives). This verdict now feeds both braid status and the TUI – the same per-disk probe surfaces in status output (the smart JSON object and the SMART: text line) and in the TUI disk-detail panel.

This is complementary to smartd, not a replacement: smartd handles real-time alerts (with its own set of checks), while braid’s classifier gives at-a-glance diagnostic status. Critically, the live classifier is diagnostic only – a degraded SmartEvidence never raises an AlertCause. smartd remains the sole SMART alert source (it writes the smartd-alert flag that drives AlertCause::SmartdAlert); see ADR 014 and ADR 030. The two SMART signals don’t need to be identical but should cover the same ground between them.