This chapter provides the foundations on which the rest of the book
builds. In this chapter, we give you:
1.1. A Brief History of CGI
When
the
World Wide Web was born, there was only one web server and one web
client. The httpd
web
server was developed by the Centre d'Etudes et de
Recherche Nucléaires (CERN) in Geneva, Switzerland.
httpd has since become the generic name of the
binary executable of many web servers. When CERN stopped funding the
development of httpd, it was taken over by the
Software Development Group of the National Center for Supercomputing
Applications (NCSA). The NCSA also produced Mosaic, the first web
browser, whose developers later went on to write the Netscape client.
Mosaic
could
fetch and view static documents[2]
and images served by the httpd server. This
provided a far better means of disseminating information to large
numbers of people than sending each person an email. However, the
glut of online resources soon made search engines necessary, which
meant that users needed to be able to submit data (such as a search
string) and servers needed to process that data and return
appropriate content.
[2]A
static document is one that exists in a constant state, such as a
text file that doesn't change.
Search engines were first
implemented by extending the web server, modifying its source code
directly. Rewriting the source was not very practical, however, so
the NCSA developed the Common Gateway Interface
(CGI) specification. CGI became a standard for interfacing external
applications with web servers and other information servers and
generating dynamic information.
A CGI program can be written in virtually any language that can read
from STDIN and write to STDOUT,
regardless of whether it is interpreted (e.g., the Unix shell),
compiled (e.g., C or C++), or a combination of both (e.g., Perl). The
first CGI programs were written in C and needed to be compiled into
binary executables. For this reason, the directory from which the
compiled CGI programs were executed was named
cgi-bin, and the source files directory was
named cgi-src. Nowadays most servers come with a
preconfigured directory for CGI programs called, as you have probably
guessed, cgi-bin.
1.1.1. The HTTP Protocol
Interaction between the
browser and the server is governed by the HyperText
Transfer Protocol (HTTP), now an official Internet
standard maintained by the World Wide Web Consortium (W3C). HTTP uses
a simple request/response model: the client establishes a
TCP[3]
connection to the server and sends a request, the server sends a
response, and the connection is closed. Requests and responses take
the form of messages. A message is a simple
sequence of text lines.
[3]TCP/IP is a low-level Internet protocol for
transmitting bits of data, regardless of its use.
HTTP messages have two parts. First come the
headers,
which hold descriptive information about the request or response. The
various types of headers and their possible content are fully
specified by the HTTP protocol. Headers are followed by a blank line,
then by the message body. The body is
the actual content of the message, such as an HTML page or a GIF
image. The HTTP protocol does not define the content of the body;
rather, specific headers are used to describe the content type and
its encoding. This enables new content types to be incorporated into
the Web without any fanfare.
HTTP is a stateless protocol. This means that requests
are not related to each other. This makes life simple for CGI
programs: they need worry about only the current request.
1.1.2. The Common Gateway Interface Specification
If you are new to the CGI world, there's no need to
worry—basic CGI programming is very easy. Ninety percent of
CGI-specific code is concerned with reading data submitted by a user
through an HTML form, processing it, and returning some response,
usually as an HTML document.
In this section, we will show you how easy basic CGI programming is,
rather than trying to teach you the entire CGI specification. There
are many books and online tutorials that cover CGI in
great detail (see http://hoohoo.ncsa.uiuc.edu/). Our aim is to
demonstrate that if you know Perl, you can start writing CGI scripts
almost immediately. You need to learn only two things: how to accept
data and how to generate output.
The HTTP protocol makes
clients and servers understand each
other by transferring all the information between them using headers,
where each header is a key-value pair. When you submit a form, the
CGI program looks for the headers that contain the input information,
processes the received data (e.g., queries a database for the
keywords supplied through the form), and—when it is ready to
return a response to the client—sends a special header that
tells the client what kind of information it should expect, followed
by the information itself. The server can send additional headers,
but these are optional. Figure 1-1 depicts a
typical request-response cycle.
Figure 1-1. Request-response cycle
Sometimes CGI programs can generate a response without needing any
input data from the client. For example, a news service may respond
with the latest stories without asking for any input from the client.
But if you want stories for a specific day, you have to tell the
script which day's stories you want. Hence, the
script will need to retrieve some input from you.
To get your feet wet with CGI scripts, let's look at
the classic "Hello
world" script for CGI, shown in Example 1-1.
Example 1-1. "Hello world" script
#!/usr/bin/perl -Tw
print "Content-type: text/plain\n\n";
print "Hello world!\n";
We start by sending a
Content-type
header, which tells the client that the data that follows is of
plain-text type. text/plain is a
Multipurpose
Internet Mail Extensions (MIME) type. You can find a list of widely
used MIME types in the
mime.types file, which is usually located in the
directory where your web server's configuration
files are stored.[4] Other examples of MIME types are
text/html (text in HTML format) and
video/mpeg (an MPEG stream).
[4]For more information about Internet
media types, refer to RFCs 2045, 2046, 2047, 2048, and 2077,
accessible from http://www.rfc-editor.org/.
According to the HTTP protocol, an empty line must be sent after all
headers have been sent. This empty line indicates that the actual
response data will start at the next line.[5]
[5]The
protocol specifies the end of a line as the
character sequence Ctrl-M and
Ctrl-J (carriage return and newline). On Unix and
Windows systems, this sequence is expressed in a Perl string as
\015\012, but Apache also honors
\n, which we will use throughout this book. On
EBCDIC machines, an explicit \r\nshould be used
instead.
Now save the code in hello.pl, put it into a
cgi-bin directory on your server, make the
script executable, and test the script by pointing your favorite
browser to:
http://localhost/cgi-bin/hello.pl
It should display the same output as Figure 1-2.
Figure 1-2. Hello world
A more complicated script involves parsing input data. There are a
few ways to pass data to the scripts, but the most commonly used are
the GET and POST methods.
Let's write a script that expects as input the
user's name and prints this name in its response.
We'll use the
GET
method,
which passes data in the request URI (uniform resource indicator):
http://localhost/cgi-bin/hello.pl?username=Doug
When the server accepts this request, it knows to split the URI into
two parts: a path to the script
(http://localhost/cgi-bin/hello.pl) and the
"data"
part
(username=Doug, called the
QUERY_STRING). All we have to do is parse the data
portion of the URI and extract the key username
and value Doug. The GET method
is used mostly for hardcoded queries, where no interactive input is
needed. Assuming that portions of your site are dynamically
generated, your site's menu might include the
following HTML code:
<a href="/cgi-bin/display.pl?section=news">News</a><br>
<a href="/cgi-bin/display.pl?section=stories">Stories</a><br>
<a href="/cgi-bin/display.pl?section=links">Links</a><br>
Another approach is to use an HTML form, where the user fills in
some parameters. The HTML form for the "Hello
user" script that we will look at in this section
can be either:
<form action="/cgi-bin/hello_user.pl" method="POST">
<input type="text" name="username">
<input type="submit">
</form>
or:
<form action="/cgi-bin/hello_user.pl" method="GET">
<input type="text" name="username">
<input type="submit">
</form>
Note that you can use either the GET or
POST
method in an HTML form. However, POSTshould be
used when the query has side effects, such as changing a record in a
database, while GETshould be used in simple
queries like this one (simple URL links are GET
requests).[6]
[6]See Axioms of Web
Architecture at http://www.w3.org/DesignIssues/Axioms.html#state.
Formerly, reading input data required different code, depending on
the method used to submit the data. We can now use Perl modules that
do all the work for us. The most widely used CGI library is the
CGI.pm
module, written by Lincoln Stein, which is included in the Perl
distribution. Along with parsing input data, it provides an easy API
to generate the HTML response.
Our sample "Hello
user" script is shown in Example 1-2.
Example 1-2. "Hello user" script
#!/usr/bin/perl
use CGI qw(:standard);
my $username = param('username') || "unknown";
print "Content-type: text/plain\n\n";
print "Hello $username!\n";
Notice that this script is only slightly different from the previous
one. We've pulled in the CGI.pm
module, importing a group of functions called
:standard. We then used its param(
) function to retrieve the value of the
username key. This call will return the name
submitted by any of the three ways described above (a form using
either POST, GET, or a
hardcoded name with GET; the last two are
essentially the same). If no value was supplied in the request,
param( ) returns undef.
my $username = param('username') || "unknown";
$username will contain either the submitted
username or the string "unknown" if no value was
submitted. The rest of the script is unchanged—we send the MIME
header and print the "Hello $username!"
string.[7]
[7]All scripts shown here generate plain text,
not HTML. If you generate HTML output, you have to protect the
incoming data from cross-site scripting. For more information, refer
to the CERT advisory at http://www.cert.org/advisories/CA-2000-02.html.
As we've just mentioned, CGI.pm
can help us with output generation as well. We can
use it to generate MIME headers by rewriting the
original script as shown in Example 1-3.
Example 1-3. "Hello user" script using CGI.pm
#!/usr/bin/perl
use CGI qw(:standard);
my $username = param('username') || "unknown";
print header("text/plain");
print "Hello $username!\n";
To help you learn how CGI.pm copes with more than
one parameter, consider the code in Example 1-4.
Example 1-4. CGI.pm and param( ) method
#!/usr/bin/perl
use CGI qw(:standard);
print header("text/plain");
print "The passed parameters were:\n";
for my $key ( param( ) ) {
print "$key => ", param($key), "\n";
}
Now issue the following request:
http://localhost/cgi-bin/hello_user.pl?a=foo&b=bar&c=foobar
Separating key=value Pairs
Note
that & or ; usually is used
to separate the key=value pairs. The former is
less preferable, because if you end up with a
QUERY_STRING of this format:
id=foo®=bar
some browsers will interpret ® as an SGML
entity and encode it as ®. This will
result in a corrupted QUERY_STRING:
id=foo®=bar
You have to encode & as
& if it is included in HTML. You
don't have this problem if you use
; as a separator:
id=foo;reg=bar
Both separators are supported by CGI.pm,
Apache::Request, and mod_perl's
args( ) method, which we will use in the examples
to retrieve the request parameters.
Of course, the code that builds QUERY_STRING has
to ensure that the values don't include the chosen
separator and encode it if it is used. (See RFC2854 for more
details.)
|
The browser will display:
The passed parameters were:
a => foo
b => bar
c => foobar
Now generate this form:
<form action="/cgi-bin/hello_user.pl" method="GET">
<input type="text" name="firstname">
<input type="text" name="lastname">
<input type="submit">
</form>
If we fill in only the firstname field with the
value Doug, the browser will display:
The passed parameters were:
firstname => Doug
lastname =>
If in addition the lastname field is
MacEachern, you will see:
The passed parameters were:
firstname => Doug
lastname => MacEachern
These are just a few of the many functions CGI.pm
offers. Read its manpage for detailed information by typing
perldoc CGI at your command prompt.
We used this long CGI.pm example to demonstrate
how simple basic CGI is. You shouldn't reinvent the
wheel; use standard tools when writing your own scripts, and you will
save a lot of time. Just as with Perl, you can start creating really
cool and powerful code from the very beginning, gaining more advanced
knowledge over time. There is much more to know about the CGI
specification, and you will learn about some of its advanced features
in the course of your web development practice. We will cover the
most commonly used features in this book.
For now, let CGI.pm or an equivalent library
handle the intricacies of the CGI specification, and concentrate your
efforts on the core functionality of your code.
1.1.3. Apache CGI Handling with mod_cgi
The Apache server
processes CGI scripts via an Apache module called mod_cgi. (See later
in this chapter for more information on request-processing phases and
Apache modules.) mod_cgi is built by default with the Apache core,
and the installation procedure also preconfigures a
cgi-bin directory and populates it with a few
sample CGI scripts. Write your script, move it into the
cgi-bin directory, make it readable and
executable by the web server, and you can start using it right away.
Should you wish to alter the default
configuration, there are only a few
configuration directives that you might want to modify. First, the
ScriptAlias directive:
ScriptAlias /cgi-bin/ /home/httpd/cgi-bin/
ScriptAlias controls which directories contain
server scripts. Scripts are run by the server when requested, rather
than sent as documents.
When a request is received with a path that starts with
/cgi-bin, the server searches for the file in
the /home/httpd/cgi-bin directory. It then runs
the file as an executable program, returning to the client the
generated output, not the source listing of the file.
The other important part of httpd.conf specifies
how the files in cgi-bin should be treated:
<Directory /home/httpd/cgi-bin>
Options FollowSymLinks
Order allow,deny
Allow from all
</Directory>
The above setting allows the use of symbolic links in the
/home/httpd/cgi-bin directory. It also allows
anyone to access the scripts from anywhere.
mod_cgi provides access to various server parameters through
environment variables. The script in Example 1-5
will print these environment variables.
Example 1-5. Checking environment variables
#!/usr/bin/perl
print "Content-type: text/plain\n\n";
for (keys %ENV) {
print "$_ => $ENV{$_}\n";
}
Save this script as env.pl in the directory
cgi-bin and make it executable and readable by
the server (that is, by the username under which the server runs).
Point your browser to
http://localhost/cgi-bin/env.pl and you will see
a list of parameters similar to this one:
SERVER_SOFTWARE => Server: Apache/1.3.24 (Unix) mod_perl/1.26
mod_ssl/2.8.8 OpenSSL/0.9.6
GATEWAY_INTERFACE => CGI/1.1
DOCUMENT_ROOT => /home/httpd/docs
REMOTE_ADDR => 127.0.0.1
SERVER_PROTOCOL => HTTP/1.0
REQUEST_METHOD => GET
QUERY_STRING =>
HTTP_USER_AGENT => Mozilla/5.0 Galeon/1.2.1 (X11; Linux i686; U;) Gecko/0
SERVER_ADDR => 127.0.0.1
SCRIPT_NAME => /cgi-bin/env.pl
SCRIPT_FILENAME => /home/httpd/cgi-bin/env.pl
Your code can access any of these variables with
$ENV{"somekey"}. However, some variables can be
spoofed by the client side, so you should be careful if you rely on
them for handling sensitive information. Let's look
at some of these environment variables.
SERVER_SOFTWARE => Server: Apache/1.3.24 (Unix) mod_perl/1.26
mod_ssl/2.8.8 OpenSSL/0.9.6
The
SERVER_SOFTWARE
variable tells us what components are compiled into the server, and
their version numbers. In this example, we used Apache 1.3.24,
mod_perl 1.26, mod_ssl 2.8.8, and OpenSSL 0.9.6.
GATEWAY_INTERFACE => CGI/1.1
The
GATEWAY_INTERFACE
variable is very important; in this example, it tells us that the
script is running under mod_cgi. When running under mod_perl, this
value changes to CGI-Perl/1.1.
REMOTE_ADDR => 127.0.0.1
The REMOTE_ADDR
variable tells us the remote address of the client. In this example,
both client and server were running on the same machine, so the
client is localhost (whose IP is
127.0.0.1).
SERVER_PROTOCOL => HTTP/1.0
The
SERVER_PROTOCOL
variable reports the HTTP protocol version upon which the client and
the server have agreed. Part of the communication between the client
and the server is a negotiation of which version of the HTTP protocol
to use. The highest version the two can understand will be chosen as
a result of this negotiation.
REQUEST_METHOD => GET
The now-familiar
REQUEST_METHOD
variable tells us which request method was used
(GET, in this case).
QUERY_STRING =>
The
QUERY_STRING
variable is also very important. It is used to pass the query
parameters when using the GET method.
QUERY_STRING is empty in this example, because we
didn't pass any parameters.
HTTP_USER_AGENT => Mozilla/5.0 Galeon/1.2.1 (X11; Linux i686; U;) Gecko/0
The
HTTP_USER_AGENT
variable contains the user agent specifications. In this example, we
are using Galeon on Linux. Note that this variable is very easily
spoofed.
Spoofing HTTP_USER_AGENT
If the client is a custom program rather than a widely
used browser, it can mimic its bigger brother's
signature. Here is an example of a very simple client using the
LWP library:
#!/usr/bin/perl -w
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
$ua->agent("Mozilla/5.0 Galeon/1.2.1 (X11; Linux i686; U;) Gecko/0");
my $req = new HTTP::Request('GET', 'http://localhost/cgi-bin/env.pl');
my $res = $ua->request($req);
print $res->content if $res->is_success;
This script first creates an instance of a user agent, with a
signature identical to Galeon's on Linux. It then
creates a request object, which is passed to the user agent for
processing. The response content is received and printed.
When run from the command line, the output of this script is
strikingly similar to what we obtained with the browser. It notably
prints:
HTTP_USER_AGENT => Mozilla/5.0 Galeon/1.2.1 (X11; Linux i686; U;) Gecko/0
So you can see how easy it is to fool a naïve CGI
programmer into thinking we've used Galeon as our
client program.
|
SERVER_ADDR => 127.0.0.1
SCRIPT_NAME => /cgi-bin/env.pl
SCRIPT_FILENAME => /home/httpd/cgi-bin/env.pl
The
SERVER_ADDR,
SCRIPT_NAME,
and
SCRIPT_FILENAME
variables tell us (respectively) the server address, the name of the
script as provided in the request URI, and the real path to the
script on the filesystem.
Now let's get back to the
QUERY_STRING parameter. If we submit a new request
for
http://localhost/cgi-bin/env.pl?foo=ok&bar=not_ok,
the new value of the query string is displayed:
QUERY_STRING => foo=ok&bar=not_ok
This is the variable used by CGI.pm and other
modules to extract the input data.
Keep in mind that the query string has a limited size. Although the
HTTP protocol itself does not place a limit on the length of a
URI, most server and client software
does. Apache currently accepts a maximum size of 8K (8192) characters
for the entire URI. Some older client or proxy implementations do not
properly support URIs larger than 255 characters. This is true for
some new clients as well—for example, some WAP phones have
similar limitations.
Larger chunks of information, such as complex forms, are passed to
the script using the POST method. Your CGI script should check
the REQUEST_METHOD environment variable, which is
set to POST when a request is submitted with the
POST method. The script can retrieve all submitted
data from the STDINstream. But again, let
CGI.pm or similar modules handle this process for
you; whatever the request method, you won't have to
worry about it because the key/value parameter pairs will always be
handled in the right way.