21.7. Hanging Processes: Detection and Diagnostics
Sometimes an httpd process might hang in the
middle of processing a request. This may happen because of a bug in
the code, such as being stuck in a while loop. Or
it may be blocked in a system call, perhaps waiting indefinitely for
an unavailable resource. To fix the problem, we need to learn in what
circumstances the process hangs, so that we can reproduce the
problem, which will allow us to uncover its cause.
21.7.1. Hanging Because of an Operating System Problem
Sometimes you can find a process hanging
because of some kind of system problem. For example, if the processes
was doing some disk I/O operation, it might get stuck in
uninterruptible sleep ('D' disk wait in
ps report, 'U' in
top), which indicates either that something is
broken in your kernel or that you're using NFS.
Also, usually you find that you cannot kill -9
this process.
Another process that cannot be killed with kill
-9 is a zombie process ('Z' disk wait
in ps report, <defunc>
in top), in which case the process is already
dead and Apache didn't wait on it properly (of
course, it can be some other process not related to Apache).
In the case of disk wait, you can actually get
the wait channel from ps
and look it up in your kernel symbol table to find out what resource
it was waiting on. This might point the way to what component of the
system is misbehaving, if the problem occurs frequently.
21.7.2. When a Process Might Hang
In Chapter 19, we discussed the concept of deadlock. This
can happen when two processes are each trying to acquire locks on the
resources held by the other. Neither process will release the lock it
currently holds, and thus neither can acquire a lock on the second
resource it desires.
This scenario is a very good candidate for code that might lead to a
hanging process. Since usually the deadlock cannot be resolved
without manual intervention, the two processes will hang, doing
nothing and wasting system resources, while trying to acquire locks.
An infinite loop might lead to a hanging process as well. Moreover,
such a loop will usually consume a lot of CPU resources and memory.
You should be very careful when using while and
similar loop constructs that are capable of creating endless loops.
A process relying on some external resource, for example when
accessing a file over NFS, might hang if the mounted partition it
tries to access is not available. Usually it takes a long time before
the request times out, and in the meantime the process may hang.
There are many other reasons that a process might hang, but these are
some of the most common.
21.7.3. Detecting Hanging Processes
It's not so easy to
detect hanging processes. There is no way you can tell how long the
request is taking to process by using plain system utilities such as
ps and top. The reason is
that each Apache process serves many requests without quitting.
System utilities can tell how long the process has been running since
its creation, but this information is useless in our case, since
Apache processes normally run for extended periods.
However, there are a few approaches that can help to detect a hanging
process. If the hanging process happens to demand lots of resources,
it's quite easy to spot it by using the
top utility. You will see the same
process show up in the first few lines of the automatically refreshed
report. (But often the hanging process uses few resources—e.g.,
when waiting for some event to happen.)
Another easy case is when some process thrashes the
error_log file, writing millions of error
messages there. Generally this process uses lots of resources and is
also easily spotted by using top.
Two other tools that report the status of Apache processes are the
mod_status
module, which is usually accessed from the
/server_status location, and the
Apache::VMonitor module, covered in Chapter 5.
Both tools provide counters of requests processed per Apache process.
You can watch the report for a few minutes and try to spot any
process with an unchanging number of processed requests and a status
of W (waiting). This means that it has hung.
But if you have 50 processes, it can be quite hard to spot such a
process.
Apache::Watchdog::RunAway is a hanging-processes monitor and
terminator that implements this feature and could be used to solve
this kind of problem. It's covered in Chapter 5.
If you have a really bad problem, where processes hang one after the
other, the time will come when the number of hanging processes is
equal to the value of MaxClients. This means that
no more processes will be spawned. As far as the users are concerned,
your server will be down. It is easy to detect this situation,
attempt to resolve it, and notify the administrator using a simple
crontabwatchdog that periodically requests
some very light script (see Chapter 5).
In the watchdog, you set a timeout appropriate for your service,
which may be anything from a few seconds to a few minutes. If the
server fails to respond before the timeout expires, the watchdog
spots trouble and attempts to restart the server. After a restart an
email report is sent to the administrator saying that there was a
problem and whether or not the restart was successful.
If you get such reports constantly, something is wrong with your web
service and you should review your code. Note that
it's possible that your server is being overloaded
by more requests than it can handle, so the requests are being queued
and not processed for a while, which triggers the
watchdog's alarm. If this is the case, you may need
to add more servers or more memory, or perhaps split a single machine
across a cluster of machines.
21.7.4. Determination of the Reason
Given the PID, there are three ways to find out where the server is
hanging:
Deploy the Perl calls-tracing mechanism. This will allow you to spot
the location of the Perl code that triggers the problem.
Use a system calls-tracing utility such as
strace. This approach reveals low-level details
about the misbehavior of some part of the system.
Use an interactive debugger such as gdb. When the
process is stuck and you don't know what it was
doing just before it got stuck, using gdb you can
attach to this process and print its call stack, to reveal where the
last call originated. Just like with strace, you
see the C function call trace, not the Perl high-level function
calls.
21.7.4.1. Using the Perl trace
To see where an
httpd process is spinning, the first step is to
add the following to your startup file:
package Book::StartUp;
use Carp ( );
$SIG{'USR2'} = sub {
Carp::confess("caught SIGUSR2!");
};
The above code assigns a signal handler for the
USR2signal. This signal has been chosen because
it's unlikely to be used by the other server
components.
We can check the registered signal handlers with help of
Apache::Status. Using this code, if we fetch the
URL http://localhost/perl-status?sig we will
see:
USR2 = \&Book::StartUp::_ _ANON_ _
where Book::StartUp is the name of the package
declared in startup.pl.
After applying this server configuration, let's use
the simple code in Example 21-10, where
sleep(10000) will emulate a hanging process.
Example 21-10. debug/perl_trace.pl
local $|=1;
my $r = shift;
$r->send_http_header('text/plain');
print "[$$] Going to sleep\n";
hanging_sub( );
sub hanging_sub { sleep 10000; }
We execute the above script as
http://localhost/perl/debug/perl_trace.pl. In
the script we use $|=1; to unbuffer the STDOUT
stream and we get the PID from the $$special
variable.
Now we issue the kill command, using the PID we
have just seen printed to the browser's window:
panic% kill -USR2 PID
and watch this showing up in the error_log file:
caught SIGUSR2!
at /home/httpd/perl/startup/startup.pl line 32
Book::StartUp::_ _ANON_ _('USR2') called
at /home/httpd/perl/debug/perl_trace.pl line 6
Apache::ROOT::perl::debug::perl_trace_2epl::hanging_sub( ) called
at /home/httpd/perl/debug/perl_trace.pl line 5
Apache::ROOT::perl::debug::perl_trace_2epl::handler('Apache=SCALAR(0x8309d08)')
called
at /usr/lib/perl5/site_perl/5.6.1/i386-linux/Apache/Registry.pm
line 140
eval {...} called
at /usr/lib/perl5/site_perl/5.6.1/i386-linux/Apache/Registry.pm
line 140
Apache::Registry::handler('Apache=SCALAR(0x8309d08)') called
at PerlHandler subroutine `Apache::Registry::handler' line 0
eval {...} called
at PerlHandler subroutine `Apache::Registry::handler' line 0
We can clearly see that the process
"hangs" in the code executed at
line 6 of the
/home/httpd/perl/debug/perl_trace.pl script, and
it was called by the hanging_sub( ) routine
defined at line 5.
21.7.4.2. Using the system calls trace
Let's write
another similar mod_perl script that hangs, and deploy
strace to find the point at which it hangs (see
Example 21-11).
Example 21-11. hangme.pl
local $|=1;
my $r = shift;
$r->send_http_header('text/plain');
print "PID = $$\n";
my $i = 0;
while (1) {
$i++;
sleep 1;
}
The reason this simple code hangs is obvious. It never breaks from
the while loop. As you can see, it prints the PID
of the current process to the browser. Of course, in a real situation
you cannot use the same trick—in the previous section we
presented several ways to detect the runaway processes and their
PIDs.
We save the above code in a file and make a request. As usual, we use
$|=1; in our demonstration scripts to unbuffer
STDOUT so we will immediately see the process ID. Once the script is
requested, the script prints the PID and obviously hangs. So we press
the Stop button, but the process continues to hang in this code.
Isn't Apache supposed to detect the broken
connection and abort the request? Yes and no—you will
understand soon what's really happening.
First let's attach to the process and see what
it's doing. We use the PID the script printed to the
browser—in this case, it is 10045:
It isn't what we expected to see, is it? These are
some system calls we don't see in our little
example. What we actually see is how Perl translates our code into
system calls. We know that our code hangs in this snippet:
while (1) {
$i++;
sleep 1;
}
so these must be the system calls that represent this loop, since
they are printed repeatedly.
Usually the situation is different from the one we have shown. You
first detect the hanging process, then you attach to it and watch the
trace of calls it does (or observe the last few system calls if the
process is hanging waiting for something, as when blocking on a
file-lock request). From watching the trace you figure out what
it's actually doing, and probably find the
corresponding lines in your Perl code. For example,
let's see how one process hangs while requesting an
exclusive lock on a file that is exclusively locked by another
process (see Example 21-12).
Example 21-12. excl_lock.pl
use Fcntl qw(:flock);
use Symbol;
fork( ); # child and parent do the same operation
my $fh = gensym;
open $fh, ">/tmp/lock" or die "cannot open /tmp/lock: $!";
print "$$: I'm going to obtain the lock\n";
flock $fh, LOCK_EX;
print "$$: I've got the lock\n";
sleep 30;
close $fh;
The code is simple. The process executing the code forks a second
process, and both do the same thing: generate a unique symbol to be
used as a file handle, open the lock file for writing using the
generated symbol, lock the file in exclusive mode, sleep for 30
seconds (pretending to do some lengthy operation), and close the lock
file, which also unlocks the file.
The gensym function is imported from the
Symbol module. The Fcntl module
provides us with a symbolic constant, LOCK_EX.
This is imported via the :flock tag, which imports
this and other flock( ) constants.
The code used by both processes is identical, so we cannot predict
which one will get its hands on the lock file and succeed in locking
it first. Thus, we add print( )statements to find
the PID of the process blocking (waiting to get the lock) on a lock
request.
When the above code is executed from the command line, we see that
one of the processes gets the lock:
panic% perl ./excl_lock.pl
3038: I'm going to obtain the lock
3038: I've got the lock
3037: I'm going to obtain the lock
Here we see that process 3037 is blocking, so we attach to it:
panic% strace -p 3037
about to attach c10
flock(3, LOCK_EX
It's clear from the above trace that the process is
waiting for an exclusive lock. The missing closing parenthesis is not
a typo; it means that strace
didn't yet receive a return status from the call.
After spending time watching the running traces of different scripts,
you will learn to more easily recognize what Perl code is being
executed.
21.7.4.3. Using the interactive debugger
Another way to see a trace of the running
code is to use a debugger such as gdb (the GNU
debugger). It's supposed to work on any platform
that supports the GNU development tools. Its purpose is to allow you
to see what is going on inside a program while it executes, or what
it was doing at the moment it failed.
To trace the execution of a process, gdb needs to
know the PID and the path to the binary that the process is
executing. For Perl code, it's
/usr/bin/perl (or whatever the path to your Perl
is). For httpd processes, it's
the path to your httpd executable—often
the binary is called httpd, but
there's really no standard location for it.
Here are a few examples using gdb. First,
let's go back to our last locking example, execute
it as before, and attach to the process that didn't
get the lock:
panic% gdb /usr/bin/perl 3037
After starting the debugger, we execute the
where command to see the trace:
(gdb) where
#0 0x40209791 in flock ( ) from /lib/libc.so.6
#1 0x400e8dc9 in Perl_pp_flock ( ) at pp_sys.c:2033
#2 0x40099c56 in Perl_runops_debug ( ) at run.c:53
#3 0x4003118c in S_run_body (oldscope=1) at perl.c:1443
#4 0x40030c7e in perl_run (my_perl=0x804bf00) at perl.c:1365
#5 0x804953e in main (argc=3, argv=0xbffffac4, env=0xbffffad4)
at perlmain.c:52
#6 0x4018bcbe in _ _libc_start_main ( ) from /lib/libc.so.6
That's not what we may have expected to see (i.e., a
Perl stack trace). And now it's a different trace
from the one we saw when we were using strace.
Here we see the current state of the call stack, with main(
) at the bottom of the stack and flock(
) at the top.
We have to find out the place the code was called
from—it's possible that the code calls
flock( ) in several places, and we
won't be able to locate the place in the code where
the actual problem occurs without having this information. Therefore,
we again use the curinfo macro after loading it
from the .gdbinit file:
As we can see, the program was stuck at line 9 of
/home/httpd/perl/excl_lock.pl and
that's the place to look at to resolve the problem.
When you attach to a running process with gdb, the
program stops executing and control of the program is passed to the
debugger. You can continue the normal program run with the
continue command or execute it step by step with
the next and step commands,
which you type at the gdb prompt.
(next steps over any function calls in the
source, while step steps into them.)
The use of C/C++ debuggers is a large topic, beyond the scope of this
book. The gdb man and info pages are quite good.
You might also want to check ddd (the Data Display
Debugger), which provides a visual interface to
gdb and other debuggers. It even knows how to
debug Perl programs.
For completeness, let's see the
gdb trace of the httpd
process that's hanging in the
while(1) loop of the first example in this
section:
panic% gdb /home/httpd/httpd_perl/bin/httpd 1005
(gdb) where
#0 0x402251c1 in nanosleep ( ) from /lib/libc.so.6
#1 0x40225158 in sleep ( ) from /lib/libc.so.6
#2 0x4014d3a6 in Perl_pp_sleep ( ) at pp_sys.c:4187
#3 0x400f5c56 in Perl_runops_debug ( ) at run.c:53
#4 0x4008e088 in S_call_body (myop=0xbffff688, is_eval=0) at perl.c:1796
#5 0x4008dc4f in perl_call_sv (sv=0x82fc75c, flags=4) at perl.c:1714
#6 0x807350e in perl_call_handler (sv=0x82fc75c, r=0x8309eec, args=0x0)
at mod_perl.c:1677
#7 0x80729cd in perl_run_stacked_handlers (hook=0x80d0db9 "PerlHandler",
r=0x8309eec, handlers=0x82e9b64) at mod_perl.c:1396
#8 0x80701b4 in perl_handler (r=0x8309eec) at mod_perl.c:922
#9 0x809f409 in ap_invoke_handler (r=0x8309eec) at http_config.c:517
#10 0x80b3e8f in process_request_internal (r=0x8309eec) at http_request.c:1286
#11 0x80b3efa in ap_process_request (r=0x8309eec) at http_request.c:1302
#12 0x80aae60 in child_main (child_num_arg=0) at http_main.c:4205
#13 0x80ab0e8 in make_child (s=0x80eea54, slot=0, now=981621024)
at http_main.c:4364
#14 0x80ab19c in startup_children (number_to_start=3) at http_main.c:4391
#15 0x80ab80c in standalone_main (argc=1, argv=0xbffff9e4) at http_main.c:4679
#16 0x80ac03c in main (argc=1, argv=0xbffff9e4) at http_main.c:5006
#17 0x401bbcbe in _ _libc_start_main ( ) from /lib/libc.so.6
As before, we can see a complete trace of the last executed call. To
see the line the program hangs, we use curinfo
again:
Indeed, the program spends most of its time at line 9:
7 : while (1) {
8 : $i++;
9 : sleep 1;
10: }
Since while( ) and $i++ are
executed very fast, it's almost impossible to catch
Perl running either of these instructions.
21.7.5. mod_perl gdb Debug Macros
So far we have seen only the use of the
curinfogdb macro.
Let's explore a few more gdb
macros that come with the mod_perl source and might be handy during a
problem debug.
Remember that we are still stuck in the while(1)
loop, and that's when we are going to run the macros
(assuming of course that they were loaded as per our last example).
The longmess macro shows us the full Perl
backtrace of the current state:
(gdb) longmess
at /home/httpd/perl/hangme.pl line 9
Apache::ROOT::perl::hangme_2epl::handler
('Apache=SCALAR(0x82ec0ec)') called at
/usr/lib/perl5/site_perl/5.6.1/i386-linux/Apache/Registry.pm
line 143
eval {...} called at
/usr/lib/perl5/site_perl/5.6.1/i386-linux/Apache/Registry.pm
line 143
Apache::Registry::handler('Apache=SCALAR(0x82ec0ec)')
called at (eval 29) line 0
eval {...} called at (eval 29) line 0
So we can see that we are inside the
Apache::Registry handler, which was executed via
eval( ), and the program is currently executing
the code on line 9 in the script
/home/httpd/perl/hangme.pl. Internally the macro
uses Carp::longmess( ) to generate the trace. The
shortmess macro is similar to
longmess, but it prints only the top-level
caller's package, via Carp::shortmess(
):
(gdb) shortmess
at /usr/lib/perl5/site_perl/5.6.1/i386-linux/Apache/Registry.pm
line 143
Don't search for shortmess( ) or
longmess( ) functions in the
Carp manpage—you won't find
them, as they aren't a part of the public API. The
caller macro prints the package that called the
last command:
In our example this is the
Apache::ROOT::perl::hangme_2epl package, which was
created on the fly by Apache::Registry.
Other macros allow you to look at the values of variables and will
probably require some level of Perl API knowledge. You may want to
refer to the perlxs,
perlguts and other related manpages before you
proceed with these.