|
|
|
|
NOTE: CentOS Enterprise Linux is built from the Red Hat Enterprise Linux source code. Other than logo and name changes CentOS Enterprise Linux is compatible with the equivalent Red Hat version. This document applies equally to both Red Hat and CentOS Enterprise Linux.
As stated earlier, the resources present in every system are CPU
power, bandwidth, memory, and storage. At first glance, it would
seem that monitoring would need only consist of examining these
four different things.
Unfortunately, it is not that simple. For example, consider a
disk drive. What things might you want to know about its
performance?
-
How much free space is available?
-
How many I/O operations on average does it perform each
second?
-
How long on average does it take each I/O operation to be
completed?
-
How many of those I/O operations are reads? How many are
writes?
-
What is the average amount of data read/written with each
I/O?
There are more ways of studying disk drive performance; these
points have only scratched the surface. The main concept to keep in
mind is that there are many different types of data for each
resource.
The following sections explore the types of utilization
information that would be helpful for each of the major resource
types.
In its most basic form, monitoring CPU power can be no more
difficult than determining if CPU utilization ever reaches 100%. If
CPU utilization stays below 100%, no matter what the system is
doing, there is additional processing power available for more
work.
However, it is a rare system that does not reach 100% CPU
utilization at least some of the time. At that point it is
important to examine more detailed CPU utilization data. By doing
so, it becomes possible to start determining where the majority of
your processing power is being consumed. Here are some of the more
popular CPU utilization statistics:
- User Versus System
-
The percentage of time spent performing user-level processing
versus system-level processing can point out whether a system's
load is primarily due to running applications or due to operating
system overhead. High user-level percentages tend to be good
(assuming users are not experiencing unsatisfactory performance),
while high system-level percentages tend to point toward problems
that will require further investigation.
- Context Switches
-
A context switch happens when the CPU stops running one process
and starts running another. Because each context switch requires
the operating system to take control of the CPU, excessive context
switches and high levels of system-level CPU consumption tend to go
together.
- Interrupts
-
As the name implies, interrupts are situations where the
processing being performed by the CPU is abruptly changed.
Interrupts generally occur due to hardware activity (such as an I/O
device completing an I/O operation) or due to software (such as
software interrupts that control application processing). Because
interrupts must be serviced at a system level, high interrupt rates
lead to higher system-level CPU consumption.
- Runnable Processes
-
A process may be in different states. For example, it may
be:
In these cases, the process has no need for the CPU.
However, eventually the process state changes, and the process
becomes runnable. As the name implies, a runnable process is one
that is capable of getting work done as soon as it is scheduled to
receive CPU time. However, if more than one process is runnable at
any given time, all but one of the
runnable processes must wait for their turn at the CPU. By
monitoring the number of runnable processes, it is possible to
determine how CPU-bound your system is.
Other performance metrics that reflect an impact on CPU
utilization tend to include different services the operating system
provides to processes. They may include statistics on memory
management, I/O processing, and so on. These statistics also reveal
that, when system performance is monitored, there are no boundaries
between the different statistics. In other words, CPU utilization
statistics may end up pointing to a problem in the I/O subsystem,
or memory utilization statistics may reveal an application design
flaw.
Therefore, when monitoring system performance, it is not
possible to examine any one statistic in complete isolation; only
by examining the overall picture it it possible to extract
meaningful information from any performance statistics you
gather.
Monitoring bandwidth is more difficult than the other resources
described here. The reason for this is due to the fact that
performance statistics tend to be device-based, while most of the
places where bandwidth is important tend to be the buses that
connect devices. In those instances where more than one device
shares a common bus, you might see reasonable statistics for each
device, but the aggregate load those devices place on the bus would
be much greater.
Another challenge to monitoring bandwidth is that there can be
circumstances where statistics for the devices themselves may not
be available. This is particularly true for system expansion buses
and datapaths. However, even though 100% accurate
bandwidth-related statistics may not always be available, there is
often enough information to make some level of analysis possible,
particularly when related statistics are taken into account.
Some of the more common bandwidth-related statistics are:
- Bytes received/sent
-
Network interface statistics provide an indication of the
bandwidth utilization of one of the more visible buses — the
network.
- Interface counts and rates
-
These network-related statistics can give indications of
excessive collisions, transmit and receive errors, and more.
Through the use of these statistics (particularly if the statistics
are available for more than one system on your network), it is
possible to perform a modicum of network troubleshooting even
before the more common network diagnostic tools are used.
- Transfers per Second
-
Normally collected for block I/O devices, such as disk and
high-performance tape drives, this statistic is a good way of
determining whether a particular device's bandwidth limit is being
reached. Due to their electromechanical nature, disk and tape
drives can only perform so many I/O operations every second; their
performance degrades rapidly as this limit is reached.
If there is one area where a wealth of performance statistics
can be found, it is in the area of monitoring memory utilization.
Due to the inherent complexity of today's demand-paged virtual
memory operating systems, memory utilization statistics are many
and varied. It is here that the majority of a system
administrator's work with resource management takes place.
The following statistics represent a cursory overview of
commonly-found memory management statistics:
- Page Ins/Page Outs
-
These statistics make it possible to gauge the flow of pages
from system memory to attached mass storage devices (usually disk
drives). High rates for both of these statistics can mean that the
system is short of physical memory and is thrashing, or spending more system resources on
moving pages into and out of memory than on actually running
applications.
- Active/Inactive Pages
-
These statistics show how heavily memory-resident pages are
used. A lack of inactive pages can point toward a shortage of
physical memory.
- Free, Shared, Buffered, and Cached Pages
-
These statistics provide additional detail over the more
simplistic active/inactive page statistics. By using these
statistics, it is possible to determine the overall mix of memory
utilization.
- Swap Ins/Swap Outs
-
These statistics show the system's overall swapping behavior.
Excessive rates here can point to physical memory shortages.
Successfully monitoring memory utilization requires a good
understanding of how demand-paged virtual memory operating systems
work. While such a subject alone could take up an entire book, the
basic concepts are discussed in Chapter 4
Physical and Virtual Memory. This chapter, along with
time spent actually monitoring a system, gives you the the
necessary building blocks to learn more about this subject.
Monitoring storage normally takes place at two different
levels:
The reason for this is that it is possible to have dire problems
in one area and no problems whatsoever in the other. For example,
it is possible to cause a disk drive to run out of disk space
without once causing any kind of performance-related problems.
Likewise, it is possible to have a disk drive that has 99% free
space, yet is being pushed past its limits in terms of
performance.
However, it is more likely that the average system experiences
varying degrees of resource shortages in both areas. Because of
this, it is also likely that — to some extent —
problems in one area impact the other. Most often this type of
interaction takes the form of poorer and poorer I/O performance as
a disk drive nears 0% free space although, in cases of extreme I/O
loads, it might be possible to slow I/O throughput to such a level
that applications no longer run properly.
In any case, the following statistics are useful for monitoring
storage:
- Free Space
-
Free space is probably the one resource all system
administrators watch closely; it would be a rare administrator that
never checks on free space (or has some automated way of doing
so).
- File System-Related Statistics
-
These statistics (such as number of files/directories, average
file size, etc.) provide additional detail over a single free space
percentage. As such, these statistics make it possible for system
administrators to configure the system to give the best
performance, as the I/O load imposed by a file system full of many
small files is not the same as that imposed by a file system filled
with a single massive file.
- Transfers per Second
-
This statistic is a good way of determining whether a particular
device's bandwidth limitations are being reached.
- Reads/Writes per Second
-
A slightly more detailed breakdown of transfers per second,
these statistics allow the system administrator to more fully
understand the nature of the I/O loads a storage device is
experiencing. This can be critical, as some storage technologies
have widely different performance characteristics for read versus
write operations.
|
|
|