When a site you send a lot of email to is down or slow, mail
messages will rapidly build up in the
deferred queue, or worse, in
active queue. The qshape output will show large numbers for
the destination domain in all age buckets that overlap the starting
time of the problem:
$ qshape deferred | head
T 5 10 20 40 80 160 320 640 1280 1280+
TOTAL 5000 200 200 400 800 1600 1000 200 200 200 200
highvolume.com 4000 160 160 320 640 1280 1440 0 0 0 0
Here the "highvolume.com" destination is continuing to accumulate
deferred mail. The
active queues are fine, but the
deferred queue started growing some time between 1 and 2 hours ago
and continues to grow.
If the high volume destination is not down, but is instead
slow, one might see similar congestion in the
active queue. Active
queue congestion is a greater cause for alarm; one might need to
take measures to ensure that the mail is deferred instead or even
access(5) rule asking the sender to try again later.
If a high volume destination exhibits frequent bursts of
consecutive connections refused by all MX hosts or "421 Server busy
errors", it is possible for the queue manager to mark the destination
as "dead" despite the transient nature of the errors. The destination
will be retried again after the expiration of a $
timer. If the error bursts are frequent enough it may be that only
a small quantity of email is delivered before the destination is
again marked "dead".
The MTA that has been observed most frequently to exhibit such
bursts of errors is Microsoft Exchange, which refuses connections
under load. Some proxy virus scanners in front of the Exchange
server propagate the refused connection to the client as a "421"
Note that it is now possible to configure Postfix to exhibit
similarly erratic behavior by misconfiguring the
(not included in Postfix 2.1.). Do not use
anvil(8) for steady-state
rate limiting, its purpose is DoS prevention and the rate limits
set should be very generous!
In the long run it is hoped that the Postfix dead host detection
and concurrency control mechanism will be tuned to be more "noise"
tolerant. If one finds oneself needing to deliver a high volume
of mail to a destination that exhibits frequent brief bursts of
errors, there is a subtle workaround.
In master.cf set up a dedicated clone of the "smtp"
transport for the destination in question.
In master.cf configure a reasonable process limit for the
transport (a number in the 10-20 range is typical).
IMPORTANT!!! In main.cf configure a very large initial
and destination concurrency limit for this transport (say 200).
initial_destination_concurrency = 200
transportname_destination_concurrency_limit = 200
Where transportname is the name of the master.cf entry
The effect of this surprising configuration is that up to 200
consecutive errors are tolerated without marking the destination
dead, while the total concurrency remains reasonable (10-20
processes). This trick is only for a very specialized situation:
high volume delivery into a channel with multi-error bursts
that is capable of high throughput, but is repeatedly throttled by
the bursts of errors.
When a destination is unable to handle the load even after the
Postfix process limit is reduced to 1, a desperate measure is to
insert brief delays between delivery attempts.
In the transport map entry for the problem destination,
specify a dead host as the primary nexthop.
In the master.cf entry for the transport specify the
problem destination as the
fallback_relay and specify a small
# service type private unpriv chroot wakeup maxproc command
slow unix - - n - 1 smtp
This solution forces the Postfix
smtp(8) client to wait for
smtp_connect_timeout seconds between deliveries. The solution
depends on Postfix connection management details, and needs to be
updated when SMTP connection caching is introduced.
Hopefully a more elegant solution to these problems will be
found in the future.