Postfix Bottleneck Analysis - Example 4: High volume destination backlog

Postfix Documentation
Previous Page	Home	Next Page

Example 4: High volume destination backlog

When a site you send a lot of email to is down or slow, mail messages will rapidly build up in the deferred queue, or worse, in the active queue. The qshape output will show large numbers for the destination domain in all age buckets that overlap the starting time of the problem:

$ qshape deferred | head

                    T   5  10  20  40   80  160 320 640 1280 1280+
           TOTAL 5000 200 200 400 800 1600 1000 200 200  200   200
  highvolume.com 4000 160 160 320 640 1280 1440   0   0    0     0
             ...

Here the "highvolume.com" destination is continuing to accumulate deferred mail. The incoming and active queues are fine, but the deferred queue started growing some time between 1 and 2 hours ago and continues to grow.

If the high volume destination is not down, but is instead slow, one might see similar congestion in the active queue. Active queue congestion is a greater cause for alarm; one might need to take measures to ensure that the mail is deferred instead or even add an access(5) rule asking the sender to try again later.

If a high volume destination exhibits frequent bursts of consecutive connections refused by all MX hosts or "421 Server busy errors", it is possible for the queue manager to mark the destination as "dead" despite the transient nature of the errors. The destination will be retried again after the expiration of a $ minimal_backoff_time timer. If the error bursts are frequent enough it may be that only a small quantity of email is delivered before the destination is again marked "dead".

The MTA that has been observed most frequently to exhibit such bursts of errors is Microsoft Exchange, which refuses connections under load. Some proxy virus scanners in front of the Exchange server propagate the refused connection to the client as a "421" error.

Note that it is now possible to configure Postfix to exhibit similarly erratic behavior by misconfiguring the anvil(8) server (not included in Postfix 2.1.). Do not use anvil(8) for steady-state rate limiting, its purpose is DoS prevention and the rate limits set should be very generous!

In the long run it is hoped that the Postfix dead host detection and concurrency control mechanism will be tuned to be more "noise" tolerant. If one finds oneself needing to deliver a high volume of mail to a destination that exhibits frequent brief bursts of errors, there is a subtle workaround.

In master.cf set up a dedicated clone of the "smtp" transport for the destination in question.
In master.cf configure a reasonable process limit for the transport (a number in the 10-20 range is typical).
IMPORTANT!!! In main.cf configure a very large initial and destination concurrency limit for this transport (say 200).
```
/etc/postfix/main.cf:
    
initial_destination_concurrency = 200
    transportname_destination_concurrency_limit = 200
```
Where transportname is the name of the master.cf entry in question.

The effect of this surprising configuration is that up to 200 consecutive errors are tolerated without marking the destination dead, while the total concurrency remains reasonable (10-20 processes). This trick is only for a very specialized situation: high volume delivery into a channel with multi-error bursts that is capable of high throughput, but is repeatedly throttled by the bursts of errors.

When a destination is unable to handle the load even after the Postfix process limit is reduced to 1, a desperate measure is to insert brief delays between delivery attempts.

In the transport map entry for the problem destination, specify a dead host as the primary nexthop.

In the master.cf entry for the transport specify the problem destination as the fallback_relay and specify a small smtp_connect_timeout value.

/etc/postfix/transport:
    problem.example.com  slow:[dead.host]

/etc/postfix/master.cf:
    # service type  private unpriv  chroot  wakeup  maxproc command
    slow      unix     -       -       n       -       1    smtp
        -o 
fallback_relay=problem.example.com
        -o 
smtp_connect_timeout=1

This solution forces the Postfix smtp(8) client to wait for $ smtp_connect_timeout seconds between deliveries. The solution depends on Postfix connection management details, and needs to be updated when SMTP connection caching is introduced.

Hopefully a more elegant solution to these problems will be found in the future.

Postfix Documentation
Previous Page	Home	Next Page