openSolaris 2008 - UFS Parameters - Solaris Tunable Parameters Reference Manual

`bufhwm` and `bufhwm_pct`

Description

Defines the maximum amount of memory for caching I/O buffers. The buffers are used for writing file system metadata (superblocks, inodes, indirect blocks, and directories). Buffers are allocated as needed until the amount of memory (in Kbytes) to be allocated exceed bufhwm. At this point, metadata is purged from the buffer cache until enough buffers are reclaimed to satisfy the request.

For historical reasons, bufhwm does not require the ufs: prefix.

Data Type

Signed integer

Default

2 percent of physical memory

Range

80 Kbytes to 20 percent of physical memory, or 2 TB, whichever is less. Consequently, bufhwm_pct can be between 1 and 20.

Units

bufhwm: Kbytes

bufhwm_pct: percent of physical memory

Dynamic?

No. bufhwm and bufhwm_pct are only evaluated at system initialization to compute hash bucket sizes. The limit in bytes calculated from these parameters is then stored in a data structure that adjusts this value as buffers are allocated and deallocated.

Attempting to adjust this value without following the locking protocol on a running system can lead to incorrect operation.

Modifying bufhwm or bufhwm_pct at runtime has no effect.

Validation

If bufhwm is less than its lower limit of 80 Kbytes or greater than its upper limit (the lesser of 20 percent of physical memory, 2 TB, or one quarter (1/4) of the maximum amount of kernel heap), it is reset to the upper limit. The following message appears on the system console and in the /var/adm/messages file if an invalid value is attempted:

"binit: bufhwm (value attempted) out of range 
(range start..range end). Using N as default."

“Value attempted” refers to the value specified in the/etc/system file or by using a kernel debugger. N is the value computed by the system based on available system memory.

Likewise, if bufhwm_pct is set to a value that is outside the allowed range of 1 percent to 20 percent, it is reset to the default of 2 percent. And, the following message appears on the system console and in the /var/adm/messages file:

"binit: bufhwm_pct(value attempted) out of range(0..20).
       Using 2 as default."

If both bufhwm or bufhwm_pct are set to non-zero values, bufhwm takes precedence.

When to Change

Because buffers are only allocated as they are needed, the overhead from the default setting is the required allocation of control structures for the buffer hash headers. These structures consume 52 bytes per potential buffer on a 32-bit kernel and 96 bytes per potential buffer on a 64-bit kernel.

On a 512-Mbyte 64-bit kernel, the number of hash chains calculates to 10316 / 32 == 322, which scales up to next power of 2, 512. Therefore, the hash headers consume 512 x 96 bytes, or 48 Kbytes. The hash header allocations assume that buffers are 32 Kbytes.

The amount of memory, which has not been allocated in the buffer pool, can be found by looking at the bfreelist structure in the kernel with a kernel debugger. The field of interest in the structure is b_bufsize, which is the possible remaining memory in bytes. Looking at it with the buf macro by using the mdb command:

# mdb -k
Loading modules: [ unix krtld genunix ip nfs ipc ]
> bfreelist::print "struct buf" b_bufsize
b_bufsize = 0x225800

The default value for bufhwm on this system, with 6 Gbytes of memory, is 122277. You cannot determine the number of header structures used because the actual buffer size requested is usually larger than 1 Kbyte. However, some space might be profitably reclaimed from control structure allocation for this system.

The same structure on a 512-Mbyte system shows that only 4 Kbytes of 10144 Kbytes has not been allocated. When the biostats kstat is examined with kstat -n biostats, it is determined that the system had a reasonable ratio of buffer_cache_hits to buffer_cache_lookups as well. As such, the default setting is reasonable for that system.

Commitment Level

Unstable

Change History

For information, see bufhwm (Solaris 9 Releases).

`ndquot`

Description

Defines the number of quota structures for the UFS file system that should be allocated. Relevant only if quotas are enabled on one or more UFS file systems. Because of historical reasons, the ufs: prefix is not needed.

Data Type

Signed integer

Default

((maxusers x 40) / 4) + max_nprocs

Range

0 to MAXINT

Units

Quota structures

Dynamic?

Validation

None. Excessively large values hang the system.

When to Change

When the default number of quota structures is not enough. This situation is indicated by the following message displayed on the console or written in the message log:

dquot table full

Commitment Level

Unstable

`ufs_ninode`

Description

Specifies the number of inodes to be held in memory. Inodes are cached globally for UFS, not on a per-file system basis.

A key parameter in this situation is ufs_ninode. This parameter is used to compute two key limits that affect the handling of inode caching. A high watermark of ufs_ninode / 2 and a low watermark of ufs_ninode / 4 are computed.

When the system is done with an inode, one of two things can happen:

The file referred to by the inode is no longer on the system so the inode is deleted. After it is deleted, the space goes back into the inode cache for use by another inode (which is read from disk or created for a new file).
The file still exists but is no longer referenced by a running process. The inode is then placed on the idle queue. Any referenced pages are still in memory.

When inodes are idled, the kernel defers the idling process to a later time. If a file system is a logging file system, the kernel also defers deletion of inodes. Two kernel threads handle this deferred processing. Each thread is responsible for one of the queues.

When the deferred processing is done, the system drops the inode onto either a delete queue or an idle queue, each of which has a thread that can run to process it. When the inode is placed on the queue, the queue occupancy is checked against the low watermark. If the queue occupancy exceeds the low watermark, the thread associated with the queue is awakened. After the queue is awakened, the thread runs through the queue and forces any pages associated with the inode out to disk and frees the inode. The thread stops when it has removed 50 percent of the inodes on the queue at the time it was awakened.

A second mechanism is in place if the idle thread is unable to keep up with the load. When the system needs to find a vnode, it goes through the ufs_vget routine. The first thing vget does is check the length of the idle queue. If the length is above the high watermark, then it takes two inodes off the idle queue and “idles” them (flushes pages and frees inodes). vget does this before it gets an inode for its own use.

The system does attempt to optimize by placing inodes with no in-core pages at the head of the idle list and inodes with pages at the end of the idle list. However, the system does no other ordering of the list. Inodes are always removed from the front of the idle queue.

The only time that inodes are removed from the queues as a whole is when a synchronization, unmount, or remount occur.

For historical reasons, this parameter does not require the ufs: prefix.

Data Type

Signed integer

Default

ncsize

Range

0 to MAXINT

Units

Inodes

Dynamic?

Yes

Validation

If ufs_ninode is less than or equal to zero, the value is set to ncsize.

When to Change

When the default number of inodes is not enough. If the maxsize reached field as reported by kstat -n inode_cache is larger than the maxsize field in the kstat, the value of ufs_ninode might be too small. Excessive inode idling can also be a problem.

You can identify excessive inode idling by using kstat -n inode_cache to look at the inode_cache kstat. Thread idles are inodes idled by the background threads while vget idles are idles by the requesting process before using an inode.

Commitment Level

Unstable

`ufs_WRITES`

Description: If ufs_WRITES is non-zero, the number of bytes outstanding for writes on a file is checked. See ufs_HW to determine whether the write should be issued or deferred until only ufs_LW bytes are outstanding. The total number of bytes outstanding is tracked on a per-file basis so that if the limit is passed for one file, it won't affect writes to other files.
Data Type: Signed integer
Default: 1 (enabled)
Range: 0 (disabled) or 1 (enabled)
Units: Toggle (on/off)
Dynamic?: Yes
Validation: None
When to Change: When you want UFS write throttling turned off entirely. If sufficient I/O capacity does not exist, disabling this parameter can result in long service queues for disks.
Commitment Level: Unstable

`ufs_LW` and `ufs_HW`

Description

ufs_HW specifies the number of bytes outstanding on a single file barrier value. If the number of bytes outstanding is greater than this value and ufs_WRITES is set, then the write is deferred. The write is deferred by putting the thread issuing the write to sleep on a condition variable.

ufs_LW is the barrier for the number of bytes outstanding on a single file below which the condition variable on which other sleeping processes are toggled. When a write completes and the number of bytes is less than ufs_LW, then the condition variable is toggled, which causes all threads waiting on the variable to awaken and try to issue their writes.

Data Type

Signed integer

Default

8 x 1024 x 1024 for ufs_LW and 16 x 1024 x 1024 for ufs_HW

Range

0 to MAXINT

Units

Bytes

Dynamic?

Yes

Validation

None

Implicit

ufs_LW and ufs_HW have meaning only if ufs_WRITES is not equal to zero. ufs_HW and ufs_LW should be changed together to avoid needless churning when processes awaken and find that either they cannot issue a write (when ufs_LW and ufs_HW are too close) or they might have waited longer than necessary (when ufs_LW and ufs_HW are too far apart).

When to Change

Consider changing these values when file systems consist of striped volumes. The aggregate bandwidth available can easily exceed the current value of ufs_HW. Unfortunately, this parameter is not a per-file system setting.

You might also consider changing this parameter when ufs_throttles is a non-trivial number. Currently, ufs_throttles can only be accessed with a kernel debugger.

Commitment Level

Unstable

`freebehind`

Description: Enables the freebehind algorithm. When this algorithm is enabled, the system bypasses the file system cache on newly read blocks when sequential I/O is detected during times of heavy memory use.
Data Type: Boolean
Default: 1 (enabled)
Range: 0 (disabled) or 1 (enabled)
Dynamic?: Yes
Validation: None
When to Change: The freebehind algorithm can occur too easily. If no significant sequential file system activity is expected, disabling freebehind makes sure that all files, no matter how large, will be candidates for retention in the file system page cache. For more fine-grained tuning, see smallfile.
Commitment Level: Unstable

`smallfile`

Description

Determines the size threshold of files larger than this value are candidates for no cache retention under the freebehind algorithm.

Large memory systems contain enough memory to cache thousands of 10-Mbyte files without making severe memory demands. However, this situation is highly application dependent.

The goal of the smallfile and freebehind parameters is to reuse cached information, without causing memory shortfalls by caching too much.

Data Type

Signed integer

Default

32,768

Range

0 to 2,147,483,647

Dynamic?

Yes

Validation

None

When to Change

Increase smallfile if an application does sequential reads on medium-sized files and can most likely benefit from buffering, and the system is not otherwise under pressure for free memory. Medium-sized files are 32 Kbytes to 2 Gbytes in size.

Commitment Level

Unstable

UFS Parameters

bufhwm and bufhwm_pct

ndquot

ufs_ninode

ufs_WRITES

ufs_LW and ufs_HW

freebehind

smallfile

`bufhwm` and `bufhwm_pct`

`ndquot`

`ufs_ninode`

`ufs_WRITES`

`ufs_LW` and `ufs_HW`

`freebehind`

`smallfile`