Here are some technical remarks about email transport issues, intended mainly for system administrators using the Exim mail transfer agent (MTA).  We're continuing to see occasional problems with email providers using Exim to relay to us, and presumably to any other mail server that uses greylisting.  As we note elsewhere, we've found greylisting in practice to be a highly effective anti-spam measure, and it is now used widely. (Although as botnet-based spam declines in proportion to freemail-based spam, it is not quite as important as it was in 2004.)  There's no problem with Exim in general, but exactly why it seems some configurations of Exim can't cope with temporary 4xx error codes is still something of a mystery, and we'd appreciate any Exim experts contacting us with further information.

First I'll explain in some detail the impact of a possible misconfiguration, and then I'll look at a specific problem in cPanel installations of Exim with a high load.

Exim's retry configuration

It could just be that Exim makes it easy to override or remove Exim's standard configuration options.  The short Exim documentation says:

If the retry section is removed from the configuration, or is empty (that is, if no retry rules are defined), Exim will not retry deliveries. This turns temporary errors into permanent errors.

Turning temporary errors into permanent errors would break greylisting as well as generate unnecessary bounce behaviour when there are interruptions to DNS or network service.  Similarly a rule inserted by the sysadmin before the default rule might cause the default rule not to match, although that's very unlikely to be the issue.  Here for example is a typical unresolved question from a cPanel and Exim user: "How / where do I verify that Exim is setup to resend the email?" (the answer is not given on that page, but is mentioned below.)

The issue doesn't appear to be limited to a single Linux distribution.  This blog entry is from a Debian user apparently without a pre-existing retry configuration (the rule suggested there retries less often than the default).  Contrary to what is said there, it would seem from Section 3 of chapter 32 of the Exim documentation, that retry status is per recipient per sender address, and this behaviour logically gives Exim no cause to bounce the message even if there are many 4xx responses from the same destination host.  RFC 2821 section 4.5.51 states: 

Retries continue until the message is transmitted or the sender gives up; the give-up time generally needs to be at least 4-5 days.

And indeed for Postfix the maximum queue time is 5 days, and for Exim 4 days (apart from in some of the documentation examples.)  So one thing to check is that the default rule exists somewhere in the Exim files to avoid Exim giving up immediately.  In Debian and Ubuntu, the default rule is in the "exim4-config" package as /etc/exim4/conf.d/retry/00_exim4-config_header and /etc/exim4/conf.d/retry/30_exim4-config, but if this package is not installed the rules may need to be added manually. Here are the default rules taken from the documentation and files linked to above:

begin retry 

* * F,2h,15m; G,16h,1h,1.5; F,4d,6h

The following command can be used to test the configuration:

exim -brt smtp-gate.gn.apc.org

It should list a matching rule, very likely the one above.  If there has already been a complaint, using exigrep can verify the past retry behaviour.

Another factor affecting Exim's retry behaviour is how often the queue runner runs, which is set in a command-line option (ps aux | grep exim4 should show the current value after the "-q" switch).  An hour would seem to be unnecessarily long from the point of view of greylisting, and it can be shortened by putting

QUEUEINTERVAL='15m'

into /etc/default/exim4.

In cPanel WHM, the queue interval should be configurable under "Main" > "Server Configuration" > "Tweak Settings" > "Mail" > "Email delivery retry time".

Exim, cPanel and deliver_queue_load_max

The other thing that can stop Exim from retrying as normal is a limit, deliver_queue_load_max, normally not set in the default Exim configuration:

When this option is set, a queue run is abandoned if the system load average becomes greater than the value of the option.

Now, if the system load average is higher than this value whenever the queue runner runs, retrying will only occur sporadically or not at all (this may be owing to several running processes on a multi-CPU system or those in uninterruptible sleep or swapping).  This may be a long-standing reason behind complaints of failure to deliver from cPanel to Yahoo, or that the Exim configuration above is apparently ignored (and these logs are an example of sporadic retries too infrequent for a typical greylisting window of 6 hours).  The cPanel default setting for this option is 3, but the value can be raised to a value appropriate for the number of CPUs and usual machine load, in cPanel under "WHM" > "Exim Configuration Editor" > "Advanced Mode".  In the top area, you can add:

deliver_queue_load_max = 20

and then "Save". A suggested value is 5 times the number of cores in the machine. You might want to increase queue_run_max at the same time to speed delivery if you have many CPUs. Thanks to Steve at Krystal Hosting for help in identifying the cPanel issue. It looks like cPanel will be addressing it in a future release, 11.32, under case numbers 42802 and 52773.

Ensuring the correct rules are present, and that queue runner interval and deliver_queue_load_max are optimised, will result in much more reliable mail delivery from the Exim relay.