Unix Programming - Data-Driven Programming - Case Study: Statistical Spam Filtering
One interesting case of data-driven programming is statistical
learning algorithms for detecting spam (unsolicited bulk email).
A whole class of mail filter programs (those easily findable
by Web search include popfile,
bogofilter) use a database of word
correlations to replace the elaborate pattern-matching conditional
logic of pattern-matching spam filters.
Programs like these became common on the Internet very rapidly
following Paul Graham's landmark paper A Plan for
Spam [Graham] in 2002. While the
explosion was triggered by the increasing cost of the pattern-matching
arms race, the statistical-filtering idea was adopted first and
fastest by Unix shops. In part, this was certainly because almost all
the Internet service providers (who are most burdened by spam, and
thus had most incentive to adopt effective new techniques) are Unix
shops — but undoubtedly the harmony with some traditional themes
in Unix software design helped as well.
Conventional spam filters require that a system administrator,
or some other responsible party, maintain information on patterns of
text found in spam — names of sites that emit nothing but spam,
come-on phrases often used by pornography sites or Internet scam
artists, and the like. In his paper, Graham noted accurately that
computer programmers like the idea of pattern-matching filters, and
sometimes have difficulty seeing past that approach, because it offers
them so many opportunities to be clever.
Statistical spam filters, on the other hand, work by collecting
feedback about what the user judges to be spam versus nonspam. That
feedback is processed into databases of statistical correlation
coefficients or weights connecting words or phrases
to the user's spam/nonspam classification. The most popular
algorithms use minor variants of Bayes's Theorem on conditional
probabilities, but other techniques (including various sorts of
polynomial hashing) are also employed.
In all these programs, the correlation check is a relatively
trivial mathematical formula. The weights fed into the formula along
with the message being checked serve as implicit control
structure for the filtering algorithm.
The problem with conventional pattern-matching spam filters is
that they are brittle. Spammers are constantly gaming against the
filter-rule databases, forcing the filter maintainers to constantly
reprogram their filters to stay ahead in the arms race. Statistical
spam filters generate their own filter rules from the user feedback.
In fact, experience with statistical filters seems to show that the
particular learning algorithm used is far less important than the
quality of the spam and nonspam data sets from which the learning algorithm
computes its weights. So the results of statistical filters really
are driven more by the shape of the data than by the algorithm.
A Plan for Spam was something of a
bombshell because its author argued convincingly that a simple, even
crude, statistical approach gave a lower rate of nonspam being
erroneously classified as spam than either elaborate pattern-matching
techniques or the human eyeball could manage. For Unix programmers,
seeing past the lure of clever pattern-matching was far easier than in
other programming cultures without as strong an attachment to
“Keep It Simple, Stupid!”
[an error occurred while processing this directive]