NOTE: CentOS Enterprise Linux is built from the Red Hat Enterprise Linux source code. Other than logo and name changes CentOS Enterprise Linux is compatible with the equivalent Red Hat version. This document applies equally to both Red Hat and CentOS Enterprise Linux.
Chapter 8. Planning for Disaster
Disaster planning is a subject that is easy for a system
administrator to forget — it is not pleasant, and it always
seems that there is something else more pressing to do. However,
letting disaster planning slide is one of the worst things a system
administrator can do.
Although it is often the dramatic disasters (such as a fire,
flood, or storm) that first come to mind, the more mundane problems
(such as construction workers cutting cables or even an overflowing
sink) can be just as disruptive. Therefore, the definition of a
disaster that a system administrator should keep in mind is any
unplanned event that disrupts the normal operation of the
While it would be impossible to list all the different types of
disasters that could strike, this section examines the leading
factors that are part of each type of disaster so that any possible
exposure can be examined not in terms of its likelihood, but in
terms of the factors that could lead to disaster.
In general, there are four different factors that can trigger a
disaster. These factors are:
Hardware failures are easy to understand — the hardware
fails, and work grinds to a halt. What is more difficult to
understand is the nature of the failures and how your exposure to
them can be minimized. Here are some approaches that you can
At its simplest, exposure due to hardware failures can be
reduced by having spare hardware available. Of course, this
approach assumes two things:
Someone on-site has the necessary skills to diagnose the
problem, identify the failing hardware, and replace it.
A replacement for the failing hardware is available.
These issues are covered in more detail in the following
Depending on your past experience and the hardware involved,
having the necessary skills might be a non-issue. However, if you
have not worked with hardware before, you might consider looking
into local community colleges for introductory courses on PC
repair. While such a course is not in and of itself sufficient to
prepare you for tackling problems with an enterprise-level server,
it is a good way to learn the basics (proper handling of tools and
components, basic diagnostic procedures, and so on).
Before taking the approach of first fixing it yourself, make
sure that the hardware in question:
If you attempt repairs on hardware that is covered by a warranty
and/or service contract, you are likely violating the terms of
these agreements and jeopardizing your continued coverage.
However, even with minimal skills, it might be possible to
effectively diagnose and replace failing hardware — if you
choose your stock of replacement hardware properly.
This question illustrates the multi-faceted nature of anything
related to disaster recovery. When considering what hardware to
stock, here are some of the issues you should keep in mind:
Maximum allowable downtime
The skill required to make the repair
Budget available for spares
Storage space required for spares
Other hardware that could utilize the same spares
Each of these issues has a bearing on the types of spares that
should be stocked. For example, stocking complete systems would
tend to minimize downtime and require minimal skills to install but
would be much more expensive than having a spare CPU and RAM module
on a shelf. However, this expense might be worthwhile if your
organization has several dozen identical servers that could benefit
from a single spare system.
No matter what the final decision, the following question is
inevitable and is discussed next.
The question of spare stock levels is also multi-faceted. Here
the main issues are:
Maximum allowable downtime
Projected rate of failure
Estimated time to replenish stock
Budget available for spares
Storage space required for spares
Other hardware that could utilize the same spares
At one extreme, for a system that can afford to be down a
maximum of two days, and a spare that might be used once a year and
could be replenished in a day, it would make sense to carry only
one spare (and maybe even none, if you were confident of your
ability to secure a spare within 24 hours).
At the other end of the spectrum, a system that can afford to be
down no more than a few minutes, and a spare that might be used
once a month (and could take several weeks to replenish) might mean
that a half dozen spares (or more) should be on the shelf.
When is a spare not a spare? When it is hardware that is in
day-to-day use but is also available to serve as a spare for a
higher-priority system should the need arise. This approach has
There are, however, downsides to this approach:
Given these constraints, the use of another production system as
a spare may work, but the success of this approach hinges on the
system's specific workload and the impact the system's absence has
on overall data center operations.
Service contracts make the issue of hardware failures someone
else's problem. All that is necessary for you to do is to confirm
that a failure has, in fact, occurred and that it does not appear
to have a software-related cause. You then make a telephone call,
and someone shows up to make things right again.
It seems so simple. But as with most things in life, there is
more to it than meets the eye. Here are some things that you must
consider when looking at a service contract:
Hours of coverage
Hardware to be covered
We explore each of these details more closely in the following
Different service contracts are available to meet different
needs; one of the big variables between different contracts relates
to the hours of coverage. Unless you are willing to pay a premium
for the privilege, you cannot just call any time and expect to see
a technician at your door a short time later.
Instead, depending on your contract, you might find that you
cannot even phone the service company until a specific day/time, or
if you can, they will not dispatch a technician until the day/time
specified for your contract.
Most hours of coverage are defined in terms of the hours and the
days during which a technician may be dispatched. Some of the more
common hours of coverage are:
Monday through Friday, 09:00 to 17:00
Monday through Friday, 12/18/24 hours each day (with the start
and stop times mutually agreed upon)
Monday through Saturday (or Monday through Sunday), same times
As you might expect, the cost of a contract increases with the
hours of coverage. In general, extending the coverage Monday
through Friday tends to cost less than adding on Saturday and
But even here there is a possibility of reducing costs if you
are willing to do some of the work.
If your situation does not require anything more than the
availability of a technician during standard business hours and you
have sufficient experience to be able to determine what is broken,
you might consider looking at depot
service. Known by many names (including walk-in service and drop-off
service), manufacturers may have service depots where
technicians work on hardware brought in by customers.
Depot service has the benefit of being as fast as you are. You
do not have to wait for a technician to become available and show
up at your facility. Depot technicians do not go out on customer
calls, meaning that there will be someone to work on your hardware
as soon as you can get it to the depot.
Because depot service is done at a central location, there is a
good chance that any required parts will be available. This can
eliminate the need for an overnight shipment or waiting for a part
to be driven several hundred miles from another office that just
happened to have that part in stock.
There are some trade-offs, however. The most obvious is that you
cannot choose the hours of service — you get service when the
depot is open. Another aspect to this is that the technicians do
not work past their quitting time, so if your system failed at
16:30 on a Friday and you got the system to the depot by 17:00, it
will not be worked on until the technicians arrive at work the
following Monday morning.
Another trade-off is that depot service depends on having a
depot nearby. If your organization is located in a metropolitan
area, this is likely not going to be a problem. However,
organizations in more rural locations may find that a depot is a
long drive away.
If considering depot service, take a moment and consider the
mechanics of actually getting the hardware to the depot. Will you
be using a company vehicle or your own? If your own, does your
vehicle have the necessary space and load capacity? What about
insurance? Will more than one person be necessary to load and
unload the hardware?
Although these are rather mundane concerns, they should be
addressed before making the decision to use depot service.
In addition to the hours of coverage, many service agreements
specify a level of response time. In other words, when you call
requesting service, how long will it be before a technician
arrives? As you might imagine, a faster response time equates to a
more expensive service agreement.
There are limits to the response times that are available. For
instance, the travel time from the manufacturer's office to your
facility has a large bearing on the response times that are
possible. Response times in the four hour range
are usually considered among the quicker offerings. Slower response
times can range from eight hours (which effectively becomes "next
day" service for a standard business hours agreement), to 24 hours.
As with every other aspect of a service agreement, even these times
are negotiable — for the right price.
Although it is not a common occurrence, you should be aware that
service agreements with response time clauses can sometimes stretch
a manufacturer's service organization beyond its ability to
respond. It is not unheard of for a very busy service organization
to send somebody — anybody — on
a short response-time service call just to meet their response time
commitment. This person apparently diagnoses the problem, calling
"the office" to have someone bring "the right part."
In fact, they are just waiting until someone who is actually
capable of handling the call arrives.
While it might be understandable to see this happen under
extraordinary circumstances (such as power problems that have
damaged systems throughout their service area), if this is a
consistent method of operation you should contact the service
manager and demand an explanation.
If your response time needs are stringent (and your budget
correspondingly large), there is one approach that can cut your
response times even further — to zero.
Given the appropriate situation (you are one of the biggest
customers in the area), sufficient need (downtime of any magnitude is unacceptable), and financial
resources (if you have to ask for the price, you probably cannot
afford it), you might be a candidate for a full-time, on-site
technician. The benefits of having a technician always standing by
As you might expect, this option can be very expensive, particularly if you require an
on-site technician 24x7. But if this approach is appropriate for
your organization, you should keep a number of points in mind in
order to gain the most benefit.
First, on-site technicians need many of the resources of a
regular employee, such as a workspace, telephone, appropriate
access cards and/or keys, and so on.
On-site technicians are not very helpful if they do not have the
proper parts. Therefore, make sure that secure storage is set aside
for the technician's spare parts. In addition, make sure that the
technician keeps a stock of parts appropriate for your
configuration and that those parts are not routinely "cannibalized"
by other technicians for their customers.
Obviously, the availability of parts plays a large role in
limiting your organization's exposure to hardware failures. In the
context of a service agreement, the availability of parts takes on
another dimension, as the availability of parts applies not only to
your organization, but to any other customer in the manufacturer's
territory that might need those parts as well. Another organization
that has purchased more of the manufacturer's hardware than you
might get preferential treatment when it comes to getting parts
(and technicians, for that matter).
Unfortunately, there is little that can be done in such
circumstances, short of working out the problem with the service
As outlined above, service contracts vary in price according to
the nature of the services being provided. Keep in mind that the
costs associated with a service contract are a recurring expense;
each time the contract is due to expire you must negotiate a new
contract and pay again.
Here is an area where you might be able to help keep costs to a
minimum. Consider for a moment that you have negotiated a service
agreement that has an on-site technician 24x7, on-site spares
— you name it. Every single piece of hardware you have
purchased from this vendor is covered, including the PC that the
company receptionist uses for non-critical tasks.
Does that PC really need to have someone
on-site 24x7? Even if the PC is vital to the receptionist's job,
the receptionist only works from 09:00 to 17:00; it is highly
The PC will be in use from 17:00 to 09:00 the next morning (not
to mention weekends)
A failure of this PC will be noticed, except between 09:00 and
Therefore, paying on the chance that this PC might need to be
serviced in the middle of a Saturday night is a waste of money.
The thing to do is to split up the service agreement such that
non-critical hardware is grouped separately from more critical
hardware. In this way, costs can be kept as low as possible.
If you have twenty identically-configured servers that are
critical to your organization, you might be tempted to have a
high-level service agreement written for only one or two, with the
rest covered by a much less expensive agreement. Then, the
reasoning goes, no matter which one of the servers fails on a
weekend, you will say that it is the one
eligible for high-level service.
Do not do this. Not only is it
dishonest, most manufacturers keep track of such things by using
serial numbers. Even if you figure out a way around such checks,
far more is spent after being discovered than by being honest and
paying for the service you really need.
Software failures can result in extended downtimes. For example,
owners of a certain brand of computer systems noted for their
high-availability features recently experienced this firsthand. A
bug in the time handling code of the computer's operating system
resulted in each customer's systems crashing at a certain time of a
certain day. While this particular situation is a more spectacular
example of a software failure in action, other software-related
failures may be less dramatic, but still as devastating.
Software failures can strike in one of two areas:
Each type of failure has its own specific impact and is explored
in more detail in the following sections.
In this type of failure, the operating system is responsible for
the disruption in service. Operating system failures come from two
The main thing to keep in mind about operating system failures
is that they take out everything that the computer was running at
the time of the failure. As such, operating system failures can be
devastating to production.
Crashes occur when the operating system experiences an error
condition from which it cannot recover. The reasons for crashes can
range from an inability to handle an underlying hardware problem to
a bug in the kernel-level code comprising the operating system.
When an operating system crashes, the system must be rebooted in
order to continue production.
When the operating system stops handling system events, the
system grinds to a halt. This is known as a hang. Hangs can be caused by deadlocks (two resource consumers contending for
resources the other has) and livelocks
(two or more processes responding to each other's activities, but
doing no useful work), but the end result is the same — a
complete lack of productivity.
Unlike operating system failures, application failures can be
more limited in the scope of their damage. Depending on the
specific application, a single application failing might impact
only one person. On the other hand, if it is a server application
servicing a large population of client applications, the
consequences of a failure would be much more widespread.
Application failures, like operating system failures, can be due
to hangs and crashes; the only difference is that here it is the
application that is hanging or crashing.
Just as hardware vendors provide support for their products,
many software vendors make support packages available to their
customers. Except for the obvious differences (no spare hardware is
required, and most of the work can be done by support personnel
over the phone), software support contracts can be quite similar to
hardware support contracts.
The level of support provided by a software vendor can vary.
Here are some of the more common support strategies employed
Web or email support
Each type of support is described in more detail in the
Although often overlooked, software documentation can serve as a
first-level support tool. Whether online or printed, documentation
often contains the information necessary to resolve many
Self support relies on the customer using online resources to
resolve their own software-related issues. Quite often these
resources take the form of Web-based FAQs (Frequently Asked
Questions) or knowledge bases.
FAQs often have little or no selection capabilities, leaving the
customer to scroll through question after question in the hopes of
finding one that addresses the issue at hand. Knowledge bases tend
to be somewhat more sophisticated, allowing the entry of search
terms. Knowledge bases can also be quite extensive in scope, making
it a good tool for resolving problems.
Many times what looks like a self support website also includes
Web-based forms or email addresses that make it possible to send
questions to support staff. While this might at first glance appear
to be an improvement over a good self support website, it really
depends on the people answering the email.
If the support staff is overworked, it is difficult to get the
necessary information from them, as their main concern is to
quickly respond to each email and move on to the next one. The
reason for this is because nearly all support personnel are
evaluated by the number of issues that they resolve. Escalation of
issues is also difficult because there is little that can be done
within an email to encourage more timely and helpful responses
— particularly when the person reading your email is in a
hurry to move on to the next one.
The way to get the best service is to make sure that your email
addresses all the questions that a support technician might ask,
Clearly describe the nature of the problem
Include all pertinent version numbers
Describe what you have already done in an attempt to address the
problem (applied the latest patches, rebooted with a minimal
By giving the support technician more information, you stand a
better chance of getting the support you need.
As the name implies, telephone support entails speaking to a
support technician via telephone. This style of support is most
similar to hardware support; that there can be various levels of
support available (with different hours of coverage, response time,
Also known as on-site consulting, on-site software support is
normally reserved for resolving specific issues or making critical
changes, such as initial software installation and configuration,
major upgrades, and so on. As expected, this is the most expensive
type of software support available.
Still, there are instances where on-site support makes sense. As
an example, consider a small organization with a single system
administrator. The organization is going to be deploying its first
database server, but the deployment (and the organization) is not
large enough to justify hiring a dedicated database administrator.
In this situation, it can often be cheaper to bring in a specialist
from the database vendor to handle the initial deployment (and
occasionally later on, as the need arises) than it would be to
train the system administrator in a skill that will be seldom
Even though the hardware may be running perfectly, and even
though the software may be configured properly and is working as it
should, problems can still occur. The most common problems that
occur outside of the system itself have to do with the physical
environment in which the system resides.
Environmental issues can be broken into four major
For such a seemingly simple structure, a building performs a
great many functions. It provides shelter from the elements. It
provides the proper micro-climate for the building's contents. It
has mechanisms to provide power and to protect against fire, theft,
and vandalism. Performing all these functions, it is not surprising
that there is a great deal that can go wrong with a building. Here
are some possibilities to consider:
Roofs can leak, allowing water into data centers.
Various building systems (such as water, sewer, or air handling)
can fail, rendering the building uninhabitable.
Floors may have insufficient load-bearing capacity to hold the
equipment you want to put in the data center.
It is important to have a creative mind when it comes to
thinking about the different ways buildings can fail. The list
above is only meant to start you thinking along the proper
Because electricity is the lifeblood of any computer system,
power-related issues are paramount in the mind of system
administrators everywhere. There are several different aspects to
power; they are covered in more detail in the following
First, it is necessary to determine how secure your normal power
supply may be. Just like nearly every other data center, you
probably obtain your power from a local power company via power
transmission lines. Because of this, there are limits to what you
can do to make sure that your primary power supply is as secure as
Organizations located near the boundaries of a power company
might be able to negotiate connections to two different power
The costs involved in running power lines from the neighboring
grid are sizable, making this an option only for larger
organizations. However, such organizations find that the redundancy
gained outweigh the costs in many cases.
The main things to check are the methods by which the power is
brought onto your organization's property and into the building.
Are the transmission lines above ground or below? Above-ground
lines are susceptible to:
Damage from extreme weather conditions (ice, wind,
Traffic accidents that damage the poles and/or transformers
Animals straying into the wrong place and shorting out the
However, below-ground lines have their own unique
Continue to trace the power lines into your building. Do they
first go to an outside transformer? Is that transformer protected
from vehicles backing into it or trees falling on it? Are all
exposed shutoff switches protected against unauthorized use?
Once inside your building, could the power lines (or the panels
to which they attach) be subject to other problems? For instance,
could a plumbing problem flood the electrical room?
Continue tracing the power into the data center; is there
anything else that could unexpectedly interrupt your power supply?
For example, is the data center sharing one or more circuits with
non-data center loads? If so, the external load might one day trip
the circuit's overload protection, taking down the data center as
It is not enough to ensure that the data center's power source
is as secure as possible. You must also be concerned with the
quality of the power being distributed throughout the data center.
There are several factors that must be considered:
The voltage of the incoming power must be stable, with no
voltage reductions (often called sags,
droops, or brownouts) or voltage increases (often known as
spikes and surges).
The waveform must be a clean sine wave, with minimal THD (Total Harmonic Distortion).
The frequency must be stable (most countries use a power
frequency of either 50Hz or 60Hz).
The power must not include any RFI
(Radio Frequency Interference) or EMI
(Electro-Magnetic Interference) noise.
The power must be supplied at a current rating sufficient to run
the data center.
Power supplied directly from the power company does not normally
meet the standards necessary for a data center. Therefore, some
level of power conditioning is usually required. There are several
different approaches possible:
- Surge Protectors
Surge protectors do just what their name implies — they
filter surges from the power supply. Most do nothing else, leaving
equipment vulnerable to damage from other power-related
- Power Conditioners
Power conditioners attempt a more comprehensive approach;
depending on the sophistication of the unit, power conditioners
often can take care of most of the types of problems outlined
- Motor-Generator Sets
A motor-generator set is essentially a large electric motor
powered by your normal power supply. The motor is attached to a
large flywheel, which is, in turn, attached to a generator. The
motor turns the flywheel and generator, which generates electricity
in sufficient quantities to run the data center. In this way, the
data center power is electrically isolated from outside power,
meaning that most power-related problems are eliminated. The
flywheel also provides the ability to maintain power through short
outages, as it takes several seconds for the flywheel to slow to
the point at which it can no longer generate power.
- Uninterruptible Power Supplies
Some types of Uninterruptible Power Supplies (more commonly
known as a UPS) include most (if not all)
of the protection features of a power conditioner.
With the last two technologies listed above, we have started in
on the topic most people think of when they think about power
— backup power. In the next section, different approaches to
providing backup power are explored.
One power-related term that nearly everyone has heard is the
term blackout. A blackout is a complete
loss of electrical power and may last from a fraction of a second
Because the length of blackouts can vary so greatly, it is
necessary to approach the task of providing backup power using
different technologies for power outages of different lengths.
The most frequent blackouts last, on average, no more than a few
seconds; longer outages are much less frequent. Therefore,
concentrate first on protecting against blackouts of only a few
minutes in duration, then work out methods of reducing your
exposure to longer outages.
Since the majority of outages last only a few seconds, your
backup power solution must have two primary characteristics:
The backup power solutions that match these characteristics are
motor-generator sets and UPSs. The flywheel in the motor-generator
set allows the generator to continue producing electricity for
enough time to ride out outages of a second or so. Motor-generator
sets tend to be quite large and expensive, making them a practical
solution only for mid-sized and larger data centers.
However, another technology — called a UPS — can
fill in for those situations where a motor-generator set is too
expensive. It can also handle longer outages.
UPSs can be purchased in a variety of sizes — small enough
to run a single low-end PC for five minutes or large enough to
power an entire data center for an hour or more.
UPSs are made up of the following parts:
A transfer switch for switching from
the primary power supply to the backup power supply
A battery, for providing backup power
An inverter, which converts the DC
current from the battery into the AC current required by the data
Apart from the size and battery capacity of the unit, UPSs come
in two basic types:
The offline UPS uses its inverter to
generate power only when the primary power supply fails.
The online UPS uses its inverter to
generate power all the time, powering the inverter via its battery
only when the primary power supply fails.
Each type has their advantages and disadvantages. The offline
UPS is usually less expensive, because the inverter does not have
to be constructed for full-time operation. However, a problem in
the inverter of an offline UPS will go unnoticed (until the next
power outage, that is).
Online UPSs tend to be better at providing clean power to your
data center; after all, an online UPS is essentially generating
power for you full time.
But no matter what type of UPS you choose, you must properly
size the UPS to your anticipated load (thereby ensuring that the
UPS has sufficient capacity to produce electricity at the required
voltage and current), and you must
determine how long you would like to be able to run your data
center on battery power.
To determine this information, you must first identify those
loads that are to be serviced by the UPS. Go to each piece of
equipment and determine how much power it draws (this is normally
listed on a label near the unit's power cord). Write down the
voltage, watts, and/or amps. Once you have these figures for all of
the hardware, you must convert them to VA
(Volt-Amps). If you have a wattage number, you can use the listed
wattage as the VA; if you have amps, multiply it by volts to get
VA. By adding the VA figures you can arrive at the approximate VA
rating required for the UPS.
Strictly speaking, this approach to calculating VA is not
entirely correct; however, to get the true VA you would need to
know the power factor for each unit, and this information is
rarely, if ever, provided. In any case, the VA numbers obtained
from this approach reflects worst-case values, leaving a large
margin of error for safety.
Determining runtime is more of a business question than a
technical question — what sorts of outages are you willing to
protect against, and how much money are you prepared to spend to do
so? Most sites select runtimes that are less than an hour or two at
most, as battery-backed power becomes very expensive beyond this
Once we get into power outages that are measured in days, the
choices get even more expensive. The technologies capable of
handling long-term power outages are limited to generators powered
by some type of engine — diesel and gas turbine,
Keep in mind that engine-powered generators require regular
refueling while they are running. You should know your generator's
fuel "burn" rate at maximum load and arrange fuel deliveries
At this point, your options are wide open, assuming your
organization has sufficient funds. This is also an area where
experts should help you determine the best solution for your
organization. Very few system administrators have the specialized
knowledge necessary to plan the acquisition and deployment of these
kinds of power generation systems.
Portable generators of all sizes can be rented, making it
possible to have the benefits of generator power without the
initial outlay of money necessary to purchase one. However, keep in
mind that in disasters affecting your general vicinity, rented
generators will be in very short supply and very expensive.
While a black out of five minutes is little more than an
inconvenience to the personnel in a darkened office, what about an
outage that lasts an hour? Five hours? A day? A week?
The fact is, even if the data center is operating normally, an
extended outage will eventually affect your organization at some
point. Consider the following points:
What if there is no power to maintain environmental control in
the data center?
What if there is no power to maintain environmental control in
the entire building?
What if there is no power to operate personal workstations, the
telephone system, the lights?
The point here is that your organization must determine at what
point an extended outage will just have to be tolerated. Or if that
is not an option, your organization must reconsider its ability to
function completely independently of on-site power for extended
periods, meaning that very large generators will be needed to power
the entire building.
Of course, even this level of planning cannot take place in a
vacuum. It is very likely that whatever caused the extended outage
is also affecting the world outside your organization, and that the
outside world will start having an affect on your organization's
ability to continue operations, even given unlimited power
The Heating, Ventilation, and Air Conditioning (HVAC) systems used in today's office buildings are
incredibly sophisticated. Often computer controlled, the HVAC
system is vital to providing a comfortable work environment.
Data centers usually have additional air handling equipment,
primarily to remove the heat generated by the many computers and
associated equipment. Failures in an HVAC system can be devastating
to the continued operation of a data center. And given their
complexity and electro-mechanical nature, the possibilities for
failure are many and varied. Here are a few examples:
The air handling units (essentially large fans driven by large
electric motors) can fail due to electrical overload, bearing
failure, belt/pulley failure, etc.
The cooling units (often called chillers) can lose their refrigerant due to leaks,
or they can have their compressors and/or motors seize.
HVAC repair and maintenance is a very specialized field —
a field that the average system administrator should leave to the
experts. If anything, a system administrator should make sure that
the HVAC equipment serving the data center is checked for normal
operation on a daily basis (if not more frequently) and is
maintained according to the manufacturer's guidelines.
There are some types of weather that can cause problems for a
Heavy snow and ice can prevent personnel from getting to the
data center, and can even clog air conditioning condensers,
resulting in elevated data center temperatures just when no one is
able to get to the data center to take corrective action.
High winds can disrupt power and communications, with extremely
high winds actually doing damage to the building itself.
There are other types of weather than can still cause problems,
even if they are not as well known. For example, exceedingly high
temperatures can result in overburdened cooling systems, and
brownouts or blackouts as the local power grid becomes
Although there is little that can be done about the weather,
knowing the way that it can affect your data center operations can
help you to keep things running even when the weather turns
It has been said that computers really are perfect. The reasoning behind this statement is
that if you dig deeply enough, behind every computer error you will
find the human error that caused it. In this section, the more
common types of human errors and their impacts are explored.
The users of a computer can make mistakes that can have serious
impact. However, due to their normally unprivileged operating
environment, user errors tend to be localized in nature. Because
most users interact with a computer exclusively through one or more
applications, it is within applications that most end-user errors
When applications are used improperly, various problems can
Files inadvertently overwritten
Wrong data used as input to an application
Files not clearly named and organized
Files accidentally deleted
The list could go on, but this is enough to illustrate the
point. Due to users not having super-user privileges, the mistakes
they make are usually limited to their own files. As such, the best
approach is two-pronged:
Educate users in the proper use of their applications and in
proper file management techniques
Make sure backups of users' files are made regularly and that
the restoration process is as streamlined and quick as possible
Beyond this, there is little that can be done to keep user
errors to a minimum.
Operators have a more in-depth relationship with an
organization's computers than end-users. Where end-user errors tend
to be application-oriented, operators tend to perform a wider range
of tasks. Although the nature of the tasks have been dictated by
others, some of these tasks can include the use of system-level
utilities, where the potential for widespread damage due to errors
is greater. Therefore, the types of errors that an operator might
make center on the operator's ability to follow the procedures that
have been developed for the operator's use.
Operators should have sets of procedures documented and
available for nearly every action they perform. It might
be that an operator does not follow the procedures as they are laid
out. There can be several reasons for this:
The environment was changed at some time in the past, and the
procedures were never updated. Now the environment changes again,
rendering the operator's memorized procedure invalid. At this
point, even if the procedures were updated (which is unlikely,
given the fact that they were not updated before) the operator will
not be aware of it.
The environment was changed, and no procedures exist. This is
just a more out-of-control version of the previous situation.
The procedures exist and are correct, but the operator will not
(or cannot) follow them.
Depending on the management structure of your organization, you
might not be able to do much more than communicate your concerns to
the appropriate manager. In any case, making yourself available to
do what you can to help resolve the problem is the best
Even if the operator follows the procedures, and even if the
procedures are correct, it is still possible for mistakes to be
made. If this happens, the possibility exists that the operator is
careless (in which case the operator's management should become
Another explanation is that it was just a mistake. In these
cases, the best operators realize that something is wrong and seek
assistance. Always encourage the operators you work with to contact
the appropriate people immediately if they suspect something is
wrong. Although many operators are highly-skilled and able to
resolve many problems independently, the fact of the matter is that
this is not their job. And a problem that is made worse by a
well-meaning operator harms both that person's career and your
ability to quickly resolve what might originally have been a small
Unlike operators, system administrators perform a wide variety
of tasks using an organization's computers. Also unlike operators,
the tasks that system administrators perform are often not based on
Therefore, system administrators sometimes make unnecessary work
for themselves when they are not careful about what they are doing.
During the course of carrying out day-to-day responsibilities,
system administrators have more than sufficient access to the
computer systems (not to mention their super-user access
privileges) to mistakenly bring systems down.
System administrators either make errors of misconfiguration or
errors during maintenance.
System administrators must often configure various aspects of a
computer system. This configuration might include:
The list could go on quite a bit longer. The actual task of
configuration varies greatly; some tasks require editing a text
file (using any one of a hundred different configuration file
syntaxes), while other tasks require running a configuration
The fact that these tasks are all handled differently is merely
an additional challenge to the basic fact that each configuration
task itself requires different knowledge. For example, the
knowledge required to configure a mail transport agent is
fundamentally different from the knowledge required to configure a
new network connection.
Given all this, perhaps it should be surprising that so
few mistakes are actually made. In any
case, configuration is, and will continue to be, a challenge for
system administrators. Is there anything that can be done to make
the process less error-prone?
The common thread of every configuration change is that some
sort of a change is being made. The change may be large, or it may
be small. But it is still a change and should be treated in a
Many organizations implement some type of change control
process. The intent is to help system administrators (and all
parties affected by the change) to manage the process of change and
to reduce the organization's exposure to any errors that may
A change control process normally breaks the change into
different steps. Here is an example:
- Preliminary research
Preliminary research attempts to clearly define:
The nature of the change to take place
Its impact, should the change succeed
A fallback position, should the change fail
An assessment of what types of failures are possible
Preliminary research might include testing the proposed change
during a scheduled downtime, or it may go so far as to include
implementing the change first on a special test environment run on
dedicated test hardware.
The change is examined with an eye toward the actual mechanics
of implementation. The scheduling being done includes outlining the
sequencing and timing of the change (along with the sequencing and
timing of any steps necessary to back the change out should a
problem arise), as well as ensuring that the time allotted for the
change is sufficient and does not conflict with any other
The product of this process is often a checklist of steps for
the system administrator to use while making the change. Included
with each step are instructions to perform in order to back out the
change should the step fail. Estimated times are often included,
making it easier for the system administrator to determine whether
the work is on schedule or not.
At this point, the actual execution of the steps necessary to
implement the change should be straightforward and anti-climactic.
The change is either implemented, or (if trouble crops up) it is
Whether the change is implemented or not, the environment is
monitored to make sure that everything is operating as it
If the change has been implemented, all existing documentation
is updated to reflect the changed configuration.
Obviously, not all configuration changes require this level of
detail. Creating a new user account should not require any
preliminary research, and scheduling would likely consist of
determining whether the system administrator has a spare moment to
create the account. Execution would be similarly quick; monitoring
might consist of ensuring that the account was usable, and
documenting would probably entail sending an email to the new
But as configuration changes become more complex, a more formal
change control process becomes necessary.
This type of error can be insidious because there is usually so
little planning and tracking done during day-to-day
System administrators see the results of this kind of error
every day, especially from the many users that swear they did not
change a thing — the computer just broke. The user that says
this usually does not remember what they did, and when the same
thing happens to you, you may not remember what you did,
The key thing to keep in mind is that you must be able to
remember what changes you made during maintenance if you are to be
able to resolve any problems quickly. A full-blown change control
process is not realistic for the hundreds of small things done over
the course of a day. What can be done to keep track of the 101
small things a system administrator does every day?
The answer is simple — takes notes. Whether it is done in
a paper notebook, a PDA, or as comments in the affected files, take
notes. By tracking what you have done, you stand a better chance of
seeing a failure as being related to a change you recently
Sometimes the very people that are supposed to help you keep
your systems running reliably can actually make things worse. This
is not due to any conspiracy; it is just that anyone working on any
technology for any reason risks rendering that technology
inoperable. The same effect is at work when programmers fix one bug
but end up creating another.
In this case, the technician either failed to correctly diagnose
the problem and made an unnecessary (and useless) repair, or the
diagnosis was correct, but the repair was not carried out properly.
It may be that the replacement part was itself defective, or that
the proper procedure was not followed when the repair was carried
This is why it is important to be aware of what the technician
is doing at all times. By doing this, you can keep an eye out for
failures that seem to be related to the original problem in some
way. This keeps the technician on track should there be a problem;
otherwise there is a chance that the technician will view this
fault as being new and unrelated to the one that was supposedly
fixed. In this way, time is not wasted chasing the wrong
Sometimes, even though a problem was diagnosed and repaired
successfully, another problem pops up to take its place. The CPU
module was replaced, but the anti-static bag it came in was left in
the cabinet, blocking the fan and causing an over-temperature
shutdown. Or the failing disk drive in the RAID array was replaced,
but because a connector on another drive was bumped and
accidentally disconnected, the array is still down.
These things might be the result of chronic carelessness or an
honest mistake. It does not matter. What you should always do is to
carefully review the repairs made by the technician and ensure that
the system is working properly before letting the technician