To ensure serviceability, the driver must be enabled to take the following actions:
Detect faulty devices and report the fault
Remove a device as supported by the Solaris hot-plug model
Add a new device as supported by the Solaris hot-plug model
Perform periodic health checks to enable the detection of latent faults
Periodic Health Checks
A latent fault is one that does not show itself until some other
action occurs. For example, a hardware failure occurring in a device that is
a cold standby could remain undetected until a fault occurs on the master
device. At this point, the system now contains two defective devices and might
be unable to continue operation.
Latent faults that remain undetected typically cause system failure eventually. Without latent fault
checking, the overall availability of a redundant system is jeopardized. To avoid this
situation, a device driver must detect latent faults and report them in the
same way as other faults.
You should provide the driver with a mechanism for making periodic health checks
on the device. In a fault-tolerant situation where the device can be the
secondary or failover device, early detection of a failed secondary device is essential
to ensure that the secondary device can be repaired or replaced before any
failure in the primary device occurs.
Periodic health checks can be used to perform the following activities:
Check any register or memory location on the device whose value might have been altered since the last poll.
Features of a device that typically exhibit deterministic behavior include heartbeat semaphores, device timers (for example, local lbolt used by download), and event counters. Reading an updated predictable value from the device gives a degree of confidence that things are proceeding satisfactorily.
Timestamp outgoing requests such as transmit blocks or commands that are issued by the driver.
The periodic health check can look for any suspect requests that have not completed.
Initiate an action on the device that should be completed before the next scheduled check.
If this action is an interrupt, this check is an ideal way to ensure that the device's circuitry can deliver an interrupt.