Address munging
Spammers harvest email addresses using bots that surf the Net in search of
email addresses. If an email address is hidden somehow when it's published on
the Web, a bot may miss it. Address munging is the process of hiding or
disguising an address. For instance, you can write an address like this: name
[AT] domain [DOT] com, or create an image that displays the address, or write
the address in ASCII characters. For example, when you put @@. in
the HTML code, the browser translates it to
Content-based filters
Once the spammers have your email address, the fight moves to your mail
server and inbox. A simple approach to reducing spam is to filter each
message's content. With content filters, the body of the message is scanned in
search of trigger words, such as Viagra or free money. If one or more of these
keys are found, the message is marked as spam. In some implementations you
don't have a "spam/not spam" identification but instead a score (the
higher the score is, the higher the chance the message is spam), so one can
customize the system a little.
The main disadvantage of this method is that spammers often misspell words
or hide them to avoid recognition. Moreover, using a large list of trigger
words can increase the number of false positive cases.
The real evolution of these methods uses statistical analysis of a message's
contents (typically a Bayes classifier)
to recognize spam in a more adaptive way. In a mail client that employs
Bayesian filtering, the user marks a message as spam or not, and over time the
filter learns which messages are good and bad. This method can be used on both
the client side, with software such as Mozilla Thunderbird, and server
side, with packages like SpamAssassin.
Although Bayesian techniques try to resolve some of the limitations of
content-based filters, you can still get false positives. Moreover, spammers
can encapsulate the message into an image or craft the text to try to bypass
the filter.
Sender Policy Framework
Spammers often send mail from forged addresses. A system called Sender Policy
Framework (SPF) uses the Domain Name System (DNS) to decide when to reject
or accept a message.
To implement this technique, you have to add a TXT field on the DNS of your
domain, using a special syntax. You can use the wizard on the SPF homepage to generate one. The field
specifies which hosts and IP addresses are allowed to send mail from your
domain. Then, when your mail server receives a message from name@domain.com, it
makes a DNS query to domain.com searching for an SPF record. If it's found, the
mail server looks if the host of the sender of the message is in the list of
the allowed ones. Otherwise, the message is rejected.
Again, SpamAssasin is one popular open source application that can implement
SPF. Many popular messaging servers implement it directly or by applying
patches or plugins.
SPF is a good technique but it has two drawbacks. First, SPF records are not
widely used, and on domains without an SPF record, the SMTP server will accept
any message. Second, often it's difficult to decide which hosts are allowed to
send mail using a given domain as sender.
Real time blacklist
Another server-side technique called real time blacklist uses a central
database with a list of untrusted IP addresses that have been known to deliver
spam. When the SMTP server receives a message, it must query one or more of
these lists, looking for the sender's IP address. If it's found, the message is
rejected.
There are a lot of lists available. Sorbs
and SBL are two widely
used ones.
This method works well, but some lists are too restrictive and others are
too permissive. It can also lead to false positives; if someone owns an home
server (with a dynamic IP address) or was accidentally inserted into list, you
won't receive messages from him.
Greylisting
Another server-side approach relies on the way SMTP servers are used by
spammers. When a destination mail server is not available, the sending server
tries to send the message again later. Many servers used by spammers are
simpler, and care more about the number of messages sent than whether every
message arrives, so when a spammer's SMTP server gets an error during delivery,
it gives up sending the message instead of trying again.
With greylisting, a destination
mail server will reject every message from an unknown IP address with a
temporary error. A traditional mail server will retry later, and at that time
the message will be accepted. This approach requires the receiving server to
save the IP address of the sender so it can recognize it later and then accept
the message.
The main drawback of this method is increased latency when receiving
messages, though you can ameliorate that problem with techniques such as
whitelisting trusted servers.
An interesting variant of greylisting uses the method described above only
if the sender is found on an RBL list (and typically one that's very
restrictive). That way the majority of messages arrive instantly, and the rest
arrive with a little delay.
Vipul's Razor
Vipul's Razor fights spam by
promoting collaboration between users. Cloudmark
maintains centralized databases that collect a sort of hash (in effect, a small
fingerprint) of spam messages. When a user receives a message, the software
automatically queries these servers looking for the hash of the message. If
there is a match, the message is rejected. If there is no match, but the
message is junk anyway, its hash can be sent (manually or automatically) to
these centralized servers. To avoid hashbusters injection (adding data to make
the hash different) the system uses ephemeral signatures (calculating the hash
only of a random part of the message).
There are three main drawbacks of this approach:
- Hash values of some junk
messages may not be in the database when we query the system.
- Two completely different
messages could have the same hash. Although it is very uncommon, it must
be taken into consideration when evaluating false positives.
- Because the system is
powered by users, someone could decide that a message is spam even though
it isn't. In recent versions a Truth Evaluation System (a sort of users'
reputation system) improved this, but again the problem should be consider
when evaluating false positives.
Distributed Checksum Clearinghouse
The Distributed Checksum
Clearinghouse works in a way similar to Razor. It uses a kind of hash of
the message (a checksum) too and it also queries a centralized server with all
the checksums. However, in this technique, there isn't direct cooperation
between users; instead, the system is totally automatic. The mail server/client
sends to the central server the checksums of all messages (spam or not) with no
user interaction. The central system counts the occurrences of every checksum
and, when a certain threshold value is exceeded, the message is marked as spam.
In this approach, if the same message is received by a lot of users, it is
probably junk. A statistical technique called fuzzy logic is used to avoid
hashbusters injection.
Although this technique does not require huge bandwidth, it can slow down an
already overloaded mail server. Large organizations should provide a local DCC
servers instead of using one master server.
DomainKeys Identified Mail
This server-side method uses asymmetric encryption, and guarantees the
integrity of the message too. The mail server that is sending the message adds
a header to the message itself containing a digital signature of the message
content. The sending server also needs to add a special DNS record that holds
its public key (similar to SPF). On the receiving end, the mail server analyzes
the domain of the sender and retrieves its public key with a DNS query. At this
point, with the public key and an encryption algorithm, the receiving server
can verify that the message was sent from a trusted domain, and it can verify
that the message wasn't modified during the transfer.
The main drawback of this system is the low diffusion of it. Although big
companies like Yahoo! implement it, it isn't used by a lot of small servers.
White list / black list
Whitelisting and blacklisting aren't really antispam techniques but rather
additional controls that one can use with almost every method. In a whitelist
one can specify a series of trusted addresses or domains. If a sender is in
this list, all controls are skipped and the message is received without delays
or the risk of a false positive.
A blacklist collects addresses that users don't want to receive mail from.
Depending on the implementation, messages from those addresses can be rejected
or marked in some way.
Conclusion
What's the best antispam technique? The answer depends on the kind and size
of spam you receive. For example if you don't receive much email, you would
probably prefer a system with no false positive at all. Mail administrators who
don't want to maintain a complex infrastructure should avoid using Vipul's
Razor or content filters that must be trained.
You can even mix techniques, or customize them in any way you like.
Lorenzo Simionato studies computer science at the University of Venice.
For the last year he has been system administrator for his school's Linux user
group.