Apache's mod_proxy Module (Practical mod

12.7. Apache's mod_proxy Module

Apache's mod_proxy module implements a proxy and cache for Apache. It implements proxying capabilities for the following protocols: FTP, CONNECT (for SSL), HTTP/0.9, HTTP/1.0, and HTTP/1.1. The module can be configured to connect to other proxy modules for these and other protocols.

mod_proxy is part of Apache, so there is no need to install a separate server—you just have to enable this module during the Apache build process or, if you have Apache compiled as a DSO, you can compile and add this module after you have completed the build of Apache.

A setup with a mod_proxy-enabled server and a mod_perl-enabled server is depicted in Figure 12-6.

Figure 12-6. mod_proxy-enabled Apache and mod_perl-enabled Apache

We do not think the difference in speed between Apache's mod_proxy and Squid is relevant for most sites, since the real value of what they do is buffering for slow client connections. However, Squid runs as a single process and probably consumes fewer system resources.

The trade-off is that mod_rewrite is easy to use if you want to spread parts of the site across different backend servers, while mod_proxy knows how to fix up redirects containing the backend server's idea of the location. With Squid you can run a redirector process to proxy to more than one backend, but there is a problem in fixing redirects in a way that keeps the client's view of both server names and port numbers in all cases.

The difficult case is where you have DNS aliases that map to the same IP address, you want them redirected to port 80 (although the server is on a different port), and you want to keep the specific name the browser has already sent so that it does not change in the client's browser's location window.

The advantages of mod_proxy are:

No additional server is needed. We keep the plain one plus one mod_perl-enabled Apache server. All you need is to enable mod_proxy in the httpd_docs server and add a few lines to the httpd.conf file.

ProxyPass        /perl/ https://localhost:81/perl/
ProxyPassReverse /perl/ https://localhost:81/perl/

The ProxyPass directive triggers the proxying process. A request for https://example.com/perl/ is proxied by issuing a request for https://localhost:81/perl/ to the mod_perl server. mod_proxy then sends the response to the client. The URL rewriting is transparent to the client, except in one case: if the mod_perl server issues a redirect, the URL to redirect to will be specified in a Location header in the response. This is where ProxyPassReverse kicks in: it scans Location headers from the responses it gets from proxied requests and rewrites the URL before forwarding the response to the client.

It buffers mod_perl output like Squid does.

It does caching, although you have to produce correct Content-Length, Last-Modified, and Expires HTTP headers for it to work. If some of your dynamic content does not change frequently, you can dramatically increase performance by caching it with mod_proxy.

ProxyPass happens before the authentication phase, so you do not have to worry about authenticating twice.

Apache is able to accelerate secure HTTP requests completely, while also doing accelerated HTTP. With Squid you have to use an external redirection program for that.

The latest mod_proxy module (for Apache 1.3.6 and later) is reported to be very stable.

12.7.1. Concepts and Configuration Directives

In the following explanation, we will use www.example.com as the main server users access when they want to get some kind of service and backend.example.com as the machine that does the heavy work. The main and backend servers are different; they may or may not coexist on the same machine.

We'll use the mod_proxy module built into the main server to handle requests to www.example.com. For the sake of this discussion it doesn't matter what functionality is built into the backend.example.com server—obviously it'll be mod_perl for most of us, but this technique can be successfully applied to other web programming languages (PHP, Java, etc.).

12.7.1.1. ProxyPass

You can use the ProxyPass configuration directive to map remote hosts into the URL space of the local server; the local server does not act as a proxy in the conventional sense, but appears to be a mirror of the remote server.

Let's explore what this rule does:

ProxyPass   /perl/ https://backend.example.com/perl/

When a user initiates a request to https://www.example.com/perl/foo.pl, the request is picked up by mod_proxy. It issues a request for https://backend.example.com/perl/foo.pl and forwards the response to the client. This reverse proxy process is mostly transparent to the client, as long as the response data does not contain absolute URLs.

One such situation occurs when the backend server issues a redirect. The URL to redirect to is provided in a Location header in the response. The backend server will use its own ServerName and Port to build the URL to redirect to. For example, mod_dir will redirect a request for https://www.example.com/somedir/ to https://backend.example.com/somedir/ by issuing a redirect with the following header:

Location: https://backend.example.com/somedir/

Since ProxyPass forwards the response unchanged to the client, the user will see https://backend.example.com/somedir/ in her browser's location window, instead of https://www.example.com/somedir/.

You have probably noticed many examples of this from real-life web sites you've visited. Free email service providers and other similar heavy online services display the login or the main page from their main server, and then when you log in you see something like x11.example.com, then w59.example.com, etc. These are the backend servers that do the actual work.

Obviously this is not an ideal solution, but since users don't usually care about what they see in the location window, you can sometimes get away with this approach. In the following section we show a better solution that solves this issue and provides even more useful functionalities.

12.7.1.2. ProxyPassReverse

This directive lets Apache adjust the URL in the Location header on HTTP redirect responses. This is essential when Apache is used as a reverse proxy to avoid bypassing the reverse proxy because of HTTP redirects on the backend servers. It is generally used in conjunction with the ProxyPass directive to build a complete frontend proxy server.

ProxyPass          /perl/  https://backend.example.com/perl/
ProxyPassReverse   /perl/  https://backend.example.com/perl/

When a user initiates a request to https://www.example.com/perl/foo, the request is proxied to https://backend.example.com/perl/foo. Let's say the backend server responds by issuing a redirect for https://backend.example.com/perl/foo/ (adding a trailing slash). The response will include a Location header:

Location: https://backend.example.com/perl/foo/

ProxyPassReverse on the frontend server will rewrite this header to:

Location: https://www.example.com/perl/foo/

This happens completely transparently. The end user is never aware of the URL rewrites happening behind the scenes.

Note that this ProxyPassReverse directive can also be used in conjunction with the proxy pass-through feature of mod_rewrite, described later in this chapter.

12.7.1.3. Security issues

Whenever you use mod_proxy you need to make sure that your server will not become a proxy for freeriders. Allowing clients to issue proxy requests is controlled by the ProxyRequests directive. Its default setting is Off, which means proxy requests are handled only if generated internally (by ProxyPass or RewriteRule...[P] directives). Do not use the ProxyRequests directive on your reverse proxy servers.

12.7.2. Knowing the Proxypassed Connection Type

Let's say that you have a frontend server running mod_ssl, mod_rewrite, and mod_proxy. You want to make sure that your user is using a secure connection for some specific actions, such as login information submission. You don't want to let the user log in unless the request was submitted through a secure port.

Since you have to proxypass the request between the frontend and backend servers, you cannot know where the connection originated. The HTTP headers cannot reliably provide this information.

A possible solution for this problem is to have the mod_perl server listen on two different ports (e.g., 8000 and 8001) and have the mod_rewrite proxy rule in the regular server redirect to port 8000 and the mod_rewrite proxy rule in the SSL virtual host redirect to port 8001. Under the mod_perl server, use $r->connection->port or the environment variable PORT to tell if the connection is secure.

12.7.3. Buffering Feature

In addition to correcting the URI on its way back from the backend server, mod_proxy, like Squid, also provides buffering services that benefit mod_perl and similar heavy modules. The buffering feature allows mod_perl to pass the generated data to mod_proxy and move on to serve new requests, instead of waiting for a possibly slow client to receive all the data.

Figure 12-7 depicts this feature.

Figure 12-7. mod_proxy buffering

mod_perl streams the generated response into the kernel send buffer, which in turn goes into the kernel receive buffer of mod_proxy via the TCP/IP connection. mod_proxy then streams the file into the kernel send buffer, and the data goes to the client over the TCP/IP connection. There are four buffers between mod_perl and the client: two kernel send buffers, one receive buffer, and finally the mod_proxy user space buffer. Each of those buffers will take the data from the previous stage, as long as the buffer is not full. Now it's clear that in order to immediately release the mod_perl process, the generated response should fit into these four buffers.

If the data doesn't fit immediately into all buffers, mod_perl will wait until the first kernel buffer is emptied partially or completely (depending on the OS implementation) and then place more data into it. mod_perl will repeat this process until the last byte has been placed into the buffer.

The kernel's receive buffers (recvbuf) and send buffers (sendbuf) are used for different things: the receive buffers are for TCP data that hasn't been read by the application yet, and the send buffers are for application data that hasn't been sent over the network yet. The kernel buffers actually seem smaller than their declared size, because not everything goes to actual TCP/IP data. For example, if the size of the buffer is 64 KB, only about 55 KB or so can actually be used for data. Of course, the overhead varies from OS to OS.

It might not be a very good idea to increase the kernel's receive buffer too much, because you could just as easily increase mod_proxy's user space buffer size and get the same effect in terms of buffering capacity. Kernel memory is pinned (not swappable), so it's harder on the system to use a lot of it.

The user space buffer size for mod_proxy seems to be fixed at 8 KB, but changing it is just a matter of replacing HUGE_STRING_LEN with something else in src/modules/proxy/proxy_http.c under the Apache source distribution.

mod_proxy's receive buffer is configurable by the ProxyReceiveBufferSize parameter. For example:

ProxyReceiveBufferSize 16384

will create a buffer 16 KB in size. ProxyReceiveBufferSize must be bigger than or equal to 512 bytes. If it's not set or is set to 0, the system default will be used. The number it's set to should be an integral multiple of 512. ProxyReceiveBufferSize cannot be bigger than the kernel receive buffer size; if you set the value of ProxyReceiveBufferSize larger than this size, the default value will be used (a warning will be printed in this case by mod_proxy).

You can modify the source code to adjust the size of the server's internal read-write buffers by changing the definition of IOBUFSIZE in include/httpd.h.

Unfortunately, you cannot set the kernel buffers' sizes as large as you might want because there is a limit to the available physical memory and OSes have their own upper limits on the possible buffer size. To increase the physical memory limits, you have to add more RAM. You can change the OS limits as well, but these procedures are very specific to OSes. Here are some of the OSes and the procedures to increase their socket buffer sizes:

Linux

For 2.2 kernels, the maximum limit for receive buffer size is set in /proc/sys/net/core/rmem_max and the default value is in /proc/sys/net/core/rmem_default. If you want to increase the rcvbufsize above 65,535 bytes, the default maximum value, you have to first raise the absolute limit in /proc/sys/net/core/rmem_max. At runtime, execute this command to raise it to 128 KB:

panic# echo 131072 > /proc/sys/net/core/rmem_max

You probably want to put this command into /etc/rc.d/rc.local (or elsewhere, depending on the operating system and the distribution) or a similar script that is executed at server startup, so the change will take effect at system reboot.

For the 2.2.5 kernel, the maximum and default values are either 32 KB or 64 KB. You can also change the default and maximum values during kernel compilation; for that, you should alter the SK_RMEM_DEFAULT and SK_RMEM_MAX definitions, respectively. (Since kernel source files tend to change, use the grep(1) utility to find the files.)

The same applies for the write buffers. You need to adjust /proc/sys/net/core/wmem_max and possibly the default value in /proc/sys/net/core/wmem_default. If you want to adjust the kernel configuration, you have to adjust the SK_WMEM_DEFAULT and SK_WMEM_MAX definitions, respectively.

FreeBSD

Under FreeBSD it's possible to configure the kernel to have bigger socket buffers:

panic# sysctl -w kern.ipc.maxsockbuf=2621440

Solaris

Under Solaris this upper limit is specified by the tcp_max_buf parameter; its default value is 256 KB.

This buffering technique applies only to downstream data (data coming from the origin server to the proxy), not to upstream data. When the server gets an incoming stream, because a request has been issued, the first bits of data hit the mod_perl server immediately. Afterward, if the request includes a lot of data (e.g., a big POST request, usually a file upload) and the client has a slow connection, the mod_perl process will stay tied, waiting for all the data to come in (unless it decides to abort the request for some reason). Falling back on mod_cgi seems to be the best solution for specific scripts whose major function is receiving large amounts of upstream data. Another alternative is to use yet another mod_perl server, which will be dedicated to file uploads only, and have it serve those specific URIs through correct proxy configuration.

12.7.4. Closing Lingering Connections with lingerd

Because of some technical complications in TCP/IP, at the end of each client connection, it is not enough for Apache to close the socket and forget about it; instead, it needs to spend about one second lingering (waiting) on the client.[43]

[43]More details can be found at https://httpd.apache.org/docs/misc/fin_wait_2.html.

lingerd is a daemon (service) designed to take over the job of properly closing network connections from an HTTP server such as Apache and immediately freeing it to handle new connections.

lingerd can do an effective job only if HTTP KeepAlives are turned off. Since Keep-Alives are useful for images, the recommended setup is to serve dynamic content with mod_perl-enabled Apache and lingerd, and static content with plain Apache.

With a lingerdsetup, we don't have the proxy (we don't want to use lingerd on our httpd_docs server, which is also our proxy), so the buffering chain we presented earlier for the proxy setup is much shorter here (see Figure 12-8).

Figure 12-8. Shorter buffering chain

Hence, in this setup it becomes more important to have a big enough kernel send buffer.

With lingerd, a big enough kernel send buffer, and KeepAlives off, the job of spoonfeeding the data to a slow client is done by the OS kernel in the background. As a result, lingerd makes it possible to serve the same load using considerably fewer Apache processes. This translates into a reduced load on the server. It can be used as an alternative to the proxy setups we have seen so far.

For more information about lingerd, see https://www.iagora.com/about/software/lingerd/.

12.7.5. Caching Feature

Apache does caching as well. It's relevant to mod_perl only if you produce proper headers, so your scripts' output can be cached. See the Apache documentation for more details on the configuration of this capability.

To enable caching, use the CacheRoot directive, specifying the directory where cache files are to be saved:

CacheRoot /usr/local/apache/cache

Make sure that directory is writable by the user under which httpd is running.

The CacheSize directive sets the desired space usage in kilobytes:

CacheSize 50000   # 50 MB

Garbage collection, which enforces the cache size, is set in hours by the CacheGcInterval. If unspecified, the cache size will grow until disk space runs out. This setting tells mod_proxy to check that your cache doesn't exceed the maximum size every hour:

CacheGcInterval 1

CacheMaxExpirespecifies the maximum number of hours for which cached documents will be retained without checking the origin server:

CacheMaxExpire 72

If the origin server for a document did not send an expiry date in the form of an Expires header, then the CacheLastModifiedFactor will be used to estimate one by multiplying the factor by the time the document was last modified, as supplied in the Last-Modified header.

CacheLastModifiedFactor 0.1

If the content was modified 10 hours ago, mod_proxy will assume an expiration time of 10 × 0.1 = 1 hour. You should set this according to how often your content is updated.

If neither Last-Modified nor Expires is present, the CacheDefaultExpire directive specifies the number of hours until the document is expired from the cache:

CacheDefaultExpire 24

12.7.6. Build Process

To build mod_proxy into Apache, just add —enable-module=proxy during the Apache ./configurestage. Since you will probably need mod_rewrite's capability as well, enable it with —enable-module=rewrite.