openSolaris 2008 - Driver Hardening Test Harness

Driver Hardening Test Harness

The driver hardening test harness tests that the I/O fault services and defensive programming requirements have been correctly fulfilled. Hardened device drivers are resilient to potential hardware faults. You must test the resilience of device drivers as part of the driver development process. This type of testing requires that the driver handle a wide range of typical hardware faults in a controlled and repeatable way. The driver hardening test harness enables you to simulate such hardware faults in software.

The driver hardening test harness is a Solaris device driver development tool. The test harness injects a wide range of simulated hardware faults when the driver under development accesses its hardware. This section describes how to configure the test harness, create error-injection specifications (referred to as errdefs), and execute the tests on your device driver.

The test harness intercepts calls from the driver to various DDI routines, then corrupts the result of the calls as if the hardware had caused the corruption. In addition, the harness allows for corruption of accesses to specific registers as well as definition of more random types of corruption.

The test harness can generate test scripts automatically by tracing all register accesses as well as direct memory access (DMA) and interrupt usage during the running of a specified workload. A script is generated that reruns that workload while injecting a set of faults into each access.

The driver tester should remove duplicate test cases from the generated scripts.

The test harness is implemented as a device driver called bofi, which stands for bus_ops fault injection, and two user-level utilities, th_define(1M) and th_manage(1M).

The test harness does the following tasks:

Validates compliant use of Solaris DDI services
Facilitates controlled corruption of programmed I/O (PIO) and DMA requests and interference with interrupts, thus simulating faults that occur in the hardware managed by the driver
Facilitates simulation of failures in the data path between the CPU and the device, which are reported from parent nexus drivers
Monitors a driver's access during a specified workload and generates fault-injection scripts

Fault Injection

The driver hardening test harness intercepts and, when requested, corrupts each access a driver makes to its hardware. This section provides information you should understand to create faults to test the resilience of your driver.

Solaris devices are managed inside a tree-like structure called the device tree (devinfo tree). Each node of the devinfo tree stores information that relates to a particular instance of a device in the system. Each leaf node corresponds to a device driver, while all other nodes are called nexus nodes. Typically, a nexus represents a bus. A bus node isolates leaf drivers from bus dependencies, which enables architecturally independent drivers to be produced.

Many of the DDI functions, particularly the data access functions, result in upcalls to the bus nexus drivers. When a leaf driver accesses its hardware, it passes a handle to an access routine. The bus nexus understands how to manipulate the handle and fulfill the request. A DDI-compliant driver only accesses hardware through use of these DDI access routines. The test harness intercepts these upcalls before they reach the specified bus nexus. If the data access matches the criteria specified by the driver tester, the access is corrupted. If the data access does not match the criteria, it is given to the bus nexus to handle in the usual way.

A driver obtains an access handle by using the ddi_regs_map_setup(9F) function:

ddi_regs_map_setup(dip, rset, ma, offset, size, handle)

The arguments specify which “offboard” memory is to be mapped. The driver must use the returned handle when it references the mapped I/O addresses, since handles are meant to isolate drivers from the details of bus hierarchies. Therefore, do not directly use the returned mapped address, ma. Direct use of the mapped address destroys the current and future uses of the data access function mechanism.

For programmed I/O, the suite of data access functions is:

I/O to Host:

ddi_getX(handle, ma)
ddi_rep_getX(handle, buf, ma, repcnt, flag)

Host to I/O:

ddi_putX(handle, ma, value)
ddi_rep_putX()

X and repcnt are the number of bytes to be transferred. X is the bus transfer size of 8, 16, 32, or 64 bytes.

DMA has a similar, yet richer, set of data access functions.

Setting Up the Test Harness

The driver hardening test harness is part of the Solaris Developer Cluster. If you have not installed this Solaris cluster, you must manually install the test harness packages appropriate for your platform.

Installing the Test Harness

To install the test harness packages (SUNWftduu and SUNWftdur), use the pkgadd(1M) command.

As superuser, go to the directory in which the packages are located and type:

# pkgadd -d . SUNWftduu SUNWftdur

Configuring the Test Harness

After the test harness is installed, set the properties in the /kernel/drv/bofi.conf file to configure the harness to interact with your driver. When the harness configuration is complete, reboot the system to load the harness driver.

The test harness behavior is controlled by boot-time properties that are set in the /kernel/drv/bofi.conf configuration file.

When the harness is first installed, enable the harness to intercept the DDI accesses to your driver by setting these properties:

bofi-nexus: Bus nexus type, such as the PCI bus
bofi-to-test: Name of the driver under test

For example, to test a PCI bus network driver called xyznetdrv, set the following property values:

bofi-nexus="pci"
bofi-to-test="xyznetdrv"

Other properties relate to the use and harness checking of the Solaris DDI data access mechanisms for reading and writing from peripherals that use PIO and transferring data to and from peripherals that use DMA.

bofi-range-check: When this property is set, the test harness checks the consistency of the arguments that are passed to PIO data access functions.
bofi-ddi-check: When this property is set, the test harness verifies that the mapped address that is returned by ddi_map_regs_setup(9F) is not used outside of the context of the data access functions.
bofi-sync-check: When this property is set, the test harness verifies correct usage of DMA functions and ensures that the driver makes compliant use of ddi_dma_sync(9F).

Testing the Driver

This section describes how to create and inject faults by using the th_define(1M) and th_manage(1M) commands.

Creating Faults

The th_define utility provides an interface to the bofi device driver for defining errdefs. An errdef corresponds to a specification for how to corrupt a device driver's accesses to its hardware. The th_define command-line arguments determine the precise nature of the fault to be injected. If the supplied arguments define a consistent errdef, the th_define process stores the errdef with the bofi driver. The process suspends itself until the criteria given by the errdef becomes satisfied. In practice, the suspension ends when the access counts go to zero (0).

Injecting Faults

The test harness operates at the level of data accesses. A data access has the following characteristics:

Type of hardware being accessed (driver name)
Instance of the hardware being accessed (driver instance)
Register set being tested
Subset of the register set that is targeted
Direction of the transfer (read or write)
Type of access (PIO or DMA)

The test harness intercepts data accesses and injects appropriate faults into the driver. An errdef, specified by the th_define(1M) command, encodes the following information:

The driver instance and register set being tested (-n name, -i instance, and -r reg_number).
The subset of the register set eligible for corruption. This subset is indicated by providing an offset into the register set and a length from that offset (-l offset [len]).
The kind of access to be intercepted: log, pio, dma, pio_r, pio_w, dma_r, dma_w, intr (-a acc_types).
How many accesses should be faulted (-c count [failcount]).
The kind of corruption that should be applied to a qualifying access (-o operator [operand]).
- Replace datum with a fixed value (EQUAL)
- Perform a bitwise operation on the datum (AND, OR, XOR)
- Ignore the transfer (for host to I/O accesses NO_TRANSFER)
- Lose, delay, or inject spurious interrupts (LOSE, DELAY, EXTRA)

Use the -a acc_chk option to simulate framework faults in an errdef.

Fault-Injection Process

The process of injecting a fault involves two phases:

Use the th_define(1M) command to create errdefs.
Create errdefs by passing test definitions to the bofi driver, which stores the definitions so they can be accessed by using the th_manage(1M) command.
Create a workload, then use the th_manage command to activate and manage the errdef.
The th_manage command is a user interface to the various ioctls that are recognized by the bofi harness driver. The th_manage command operates at the level of driver names and instances and includes these commands: get_handles to list access handles, start to activate errdefs, and stop to deactivate errdefs.
The activation of an errdef results in qualifying data accesses to be faulted. The th_manage utility supports these commands: broadcast to provide the current state of the errdef and clear_errors to clear the errdef.
See the th_define(1M) and th_manage(1M) man pages for more information.

Test Harness Warnings

You can configure the test harness to handle warning messages in the following ways:

Write warning messages to the console
Write warning messages to the console and then panic the system

Use the second method to help pinpoint the root cause of a problem.

When the bofi-range-check property value is set to warn, the harness prints the following messages (or panics if set to panic) when it detects a range violation of a DDI function by your driver:

ddi_getX() out of range addr %x not in %x
ddi_putX() out of range addr %x not in %x
ddi_rep_getX() out of range addr %x not in %x
ddi_rep_putX() out of range addr %x not in %x

X is 8, 16, 32, or 64.

When the harness has been requested to insert over 1000 extra interrupts, the following message is printed if the driver does not detect interrupt jabber:

undetected interrupt jabber - %s %d

Using Scripts to Automate the Test Process

You can create fault-injection test scripts by using the logging access type of the th_define(1M) utility:

# th_define -n name -i instance -a log [-e fixup_script]

The th_define command takes the instance offline and brings it back online. Then th_define runs the workload that is described by the fixup_script and logs I/O accesses that are made by the driver instance.

The fixup_script is called twice with the set of optional arguments. The script is called once just before the instance is taken offline, and it is called again after the instance has been brought online.

The following variables are passed into the environment of the called executable:

DRIVER_PATH: Device path of the instance
DRIVER_INSTANCE: Instance number of the driver
DRIVER_UNCONFIGURE: Set to 1 when the instance is about to be taken offline
DRIVER_CONFIGURE: Set to 1 when the instance has just been brought online

Typically, the fixup_script ensures that the device under test is in a suitable state to be taken offline (unconfigured) or in a suitable state for error injection (for example, configured, error free, and servicing a workload). The following script is a minimal script for a network driver:

#!/bin/ksh
driver=xyznetdrv
ifnum=$driver$DRIVER_INSTANCE
 
if [[ $DRIVER_CONFIGURE = 1 ]]; then
   ifconfig $ifnum plumb    
   ifconfig $ifnum ...    
   ifworkload start $ifnum
elif [[ $DRIVER_UNCONFIGURE = 1 ]]; then    
   ifworkload stop $ifnum    
   ifconfig $ifnum down    
   ifconfig $ifnum unplumb
fi
exit $?

Note - The ifworkload command should initiate the workload as a background task. The fault injection occurs after the fixup_script configures the driver under test and brings it online (DRIVER_CONFIGURE is set to 1).

If the -e fixup_script option is present, it must be the last option on the command line. If the -e option is not present, a default script is used. The default script repeatedly attempts to bring the device under test offline and online. Thus the workload consists of the driver's attach() and detach() paths.

The resulting log is converted into a set of executable scripts that are suitable for running unassisted fault-injection tests. These scripts are created in a subdirectory of the current directory with the name driver.test.id. The scripts inject faults, one at a time, into the driver while running the workload that is described by the fixup_script.

The driver tester has substantial control over the errdefs that are produced by the test automation process. See the th_define(1M) man page.

If the tester chooses a suitable range of workloads for the test scripts, the harness gives good coverage of the hardening aspects of the driver. However, to achieve full coverage, the tester might need to create additional test cases manually. Add these cases to the test scripts. To ensure that testing completes in a timely manner, you might need to manually delete duplicate test cases.

Automated Test Process

The following process describes automated testing:

Identify the aspects of the driver to be tested.
Test all aspects of the driver that interact with the hardware:
- Attach and detach
- Plumb and unplumb under a stack
- Normal data transfer
- Documented debug modes
A separate workload script (fixup_script) must be generated for each mode of use.
For each mode of use, prepare an executable program (fixup_script) that configures and unconfigures the device, and creates and terminates a workload.
Run the th_define(1M) command with the errdefs, together with an access type of -a log.
Wait for the logs to fill.
The logs contain a dump of the bofi driver's internal buffers. This data is included at the front of the script.
Because it can take from a few seconds to several minutes to create the logs, use the th_manage broadcast command to check the progress.
Change to the created test directory and run the master test script.
The master script runs each generated test script in sequence. Separate test scripts are generated per register set.
Store the results for analysis.
Successful test results, such as success (corruption reported) and success (corruption undetected), show that the driver under test is behaving properly. The results are reported as failure (no service impact reported) if the harness detects that the driver has failed to report the service impact after reporting a fault, or if the driver fails to detect that an access or DMA handle has been marked as faulted.
It is fine for a few test not triggered failures to appear in the output. However, several such failures indicate that the test is not working properly. These failures can appear when the driver does not access the same registers as when the test scripts were generated.
Run the test on multiple instances of the driver concurrently to test the multithreading of error paths.
For example, each th_define command creates a separate directory that contains test scripts and a master script:
```
# th_define -n xyznetdrv -i 0 -a log -e script
# th_define -n xyznetdrv -i 1 -a log -e script
```
Once created, run the master scripts in parallel.

Note - The generated scripts produce only simulated fault injections that are based on what was logged during the time the logging errdef was active. When you define a workload, ensure that the required results are logged. Also analyze the resulting logs and fault-injection specifications. Verify that the hardware access coverage that the resulting test scripts created is what is required.