Some of the earliest databases implemented on Unix were Database
Management (DBM) files, and many are
still in use today. As of this writing, the
Berkeley DB is the most powerful DBM implementation. Berkeley DB is
available at http://www.sleepycat.com/. If you need a
light database with an easy API, using simple key-value pairs to
store and manipulate a relatively small number of records, DBM is the
solution that you should consider first.
With DBM, it is rare to read the whole database into memory. Combine
this feature with the use of smart storage techniques, and DBM files
can be manipulated much faster than flat files. Flat-file databases
can be very slow when the number
of records starts to grow into the thousands, especially for insert,
update, and delete operations. Sort algorithms on flat files can also
be very time-consuming.
The maximum practical size of a DBM database depends on many factors,
such as your data, your hardware, and the desired response times. But
as a rough guide, consider 5,000 to 10,000 records to be reasonable.
We will talk mostly about Berkeley DB Version 1.x, as it provides the
best functionality while having good speed and almost no limitations.
Other implementations might be faster in some cases, but they are
limited either in the length of the maximum value or the total number
of records.
There are a number of Perl interfaces
to
the major DBM implementations, such as DB_File,
NDBM_File, ODBM_File,
GDBM_File, and SDBM_File. The
original Perl module for Berkeley DB was DB_File,
which was written to interface with Berkeley DB Version 1.85. The
newer Perl module for Berkeley DB is BerkeleyDB,
which was written to interface with Version 2.0 and subsequent
releases. Because Berkeley DB Version 2.x has a compatibility API for
Version 1.85, you can (and should) build DB_File
using Version 2.x of Berkeley DB, although DB_File
will still support only the 1.85 functionality.
Several different indexing algorithms (known also as
access methods) can be used with DBM
implementations:
The HASH access method gives
an
O(1) complexity (see sidebar) of search and update, fast insert, and
delete, but a slow sort (which you have to implement yourself).
HASH is used by almost all DBM implementations.
The BTREE access method allows arbitrary key/value
pairs to be stored in a sorted, balanced binary tree. This allows you
to get a sorted sequence of data pairs in O(1) (see sidebar), at the
expense of much slower insert, update, and delete operations than is
the case with HASH. BTREE is
available mostly in Berkeley DB.
The RECNO access method is more complicated, and
enables both fixed-length and variable-length flat text files to be
manipulated using the same key/value pair interface as in
HASH and BTREE. In this case
the key will consist of a record (line) number.
RECNO is available mostly in Berkeley DB.
The QUEUE access method stores fixed-length
records with logical record numbers as keys. It is designed for fast
inserts at the tail and has a special cursor-consume operation that
deletes and returns a record from the head of the queue. The
QUEUE access method uses record-level locking.
QUEUE is available only in Berkeley DB Version 3.0
and higher.
Big-O Notation
In math, complexity is expressed using big-O
notation. For a problem of size N:
A constant-time method is "order
1": O(1)
A linear-time method is "order N":
O(N)
A quadratic-time method is "order N
squared": O(N2)
For example, a lookup action in a properly implemented hash of size N
with random data has a complexity of O(1), because the item is
located almost immediately after its hash value is calculated.
However, the same action in the list of N items has a complexity of
O(N), since on average you have to go through almost all the items in
the list before you find what you need.
Most often you will want to use the HASH method,
but there are many considerations and your choice may be dictated by
your application.
In recent years, DBM databases have been extended to allow you to
store more complex values, including data structures. The
MLDBM module can store and restore the whole
symbol table of your script, including arrays and hashes.
It is important to note that you cannot simply switch a DBM file from
one storage algorithm to another. The only way to change the
algorithm is to copy all the records one by one into a new DBM file,
initialized according to a desired access method. You can use a
script like the one shown in Example 19-1.
Example 19-1. btree2hash.pl
#!/usr/bin/perl -w
#
# This script takes as its parameters a list of Berkeley DB
# file(s) which are stored with the DB_BTREE algorithm. It
# will back them up using the .bak extension and create
# instead DBMs with the same records but stored using the
# DB_HASH algorithm.
#
# Usage: btree2hash.pl filename(s)
use strict;
use DB_File;
use Fcntl;
# @ARGV checks
die "Usage: btree2hash.pl filename(s))\n" unless @ARGV;
for my $filename (@ARGV) {
die "Can't find $filename: $!"
unless -e $filename and -r _;
# First back up the file
rename "$filename", "$filename.btree"
or die "can't rename $filename with $filename.btree: $!";
# tie both DBs (db_hash is a fresh one!)
tie my %btree , 'DB_File',"$filename.btree", O_RDWR|O_CREAT,
0660, $DB_BTREE or die "Can't tie $filename.btree: $!";
tie my %hash , 'DB_File',"$filename" , O_RDWR|O_CREAT,
0660, $DB_HASH or die "Can't tie $filename: $!";
# copy DB
%hash = %btree;
# untie
untie %btree;
untie %hash;
}
Note that some DBM implementations come with other conversion
utilities as well.
19.1. mod_perl and DBM
Where does mod_perl fit into the picture? If you need read-only
access to a DBM file in your mod_perl code, the operation is much
faster if you keep the DBM file open (tied) all the time and
therefore ready to be used. We will see an example of this in a
moment. This will work with dynamic (read/write) database accesses as
well, but you need to use locking and data flushing to avoid data
corruption.
It's possible that a process will die, for various
reasons. There are a few consequences of this event.
If the program has been using
external file locking and the lock is based
on the existence of the lock file, the code might be aborted before
it has a chance to remove the file. Therefore, the next process that
tries to get a lock will wait indefinitely, since the lock file is
dead and no one can remove it without manual intervention. Until this
lock file is removed, services relying on this lock will stay
deactivated. The requests will queue up, and at some point the whole
service will become useless as all the processes wait for the lock
file. Therefore, this locking technique is not recommended. Instead,
an advisory flock( ) method
should be used. With this method, when a
process dies, the lock file will be unlocked by the operating system,
no matter what.
Another issue lies in the fact that
if the DBM files are
modified, they have to be properly closed to ensure the integrity of
the data in the database. This requires a flushing of the DBM
buffers, or just untying of the database. In case the
code flow is aborted before the database is flushed to disk, use
Perl's END block to handle the
unexpected situations, like so:
END { my_dbm_flush( ) }
Remember that under mod_perl, this will work on each request only for
END blocks declared in scripts running under
Apache::Registry and similar handlers. Other Perl
handlers need to use the $r->register_cleanup(
) method: