|
Identifying Spam
Sometimes it's quite easy to determine if a message
is spam, based on the obvious "spam-like" content of a given
message or the name of the sender. Many spam filters work simply by
searching for the most common words and names used by spammers. However,
things are rarely that easy.
A definition we've been moving towards is
that spam is "unsolicited bulk, commercial, or objectionable email,
often sent using stolen resources." Once we unpack this definition,
it becomes clear that spam identification is problematic and requires a
systematic approach, one that cannot be completely automated. For example,
it takes a certain amount of research and analysis to determine whether
headers have been forged and at what step in the delivery process. In
addition, determining whether a message is truly "unsolicited"
opens up another level of complexity, where certain qualitative decisions
need to be made. For us, a crucial part of our mission is spam
identification. This section briefly summarizes how we evaluate a
questionable message. In short, the following are the general questions
that we ask when judging whether a message is spam.
Is there a prior relationship between the sender and the
recipient?
From our perspective, determining if a message is
unsolicited is the key goal. To this end, it helps to verify the existence
of a prior relationship between the sender and the recipient. If you
receive bulk email from a person or company that you never heard of, it is
unlikely that you requested to receive the email. An analysis of the
message content and of the business that's advertising can often rule out
or confirm a prior relationship. However, companies and individuals who
have had relationships with the victims can still send messages
unsolicited, and it's still spam.
Is there a legitimate removal option?
Another important clue is a removal option.
Removal, or "opt-out," options typically come in the form of an
email address or Web site link within the content of the email. The
recipient can then theoretically follow the removal instructions to cease
delivery of further mailings. Research on the particulars of the removal
option is necessary to distinguish spam from legitimate bulk email that
recipients may have subscribed to in the past. The presence of an
effective removal option in the email message does not by itself mean the
message is not spam. Some less legitimate senders of email actually sell
removal response messages to spammers, who then send further unsolicited
emails to these "confirmed live" email accounts. Thus, a removal
option becomes yet another tool in the spammer toolbox. Further
investigation is required to determine whether the removal option works
and does not lead to more unsolicited email messages.
Was there an attempt to conceal header information?
The next bit of detective work involves examining
the headers. The headers provide information such as the sender of the
message, the recipient, the mailer that was used to send the mail, the
names of the different servers that processed the message along the way,
and so on. Header information provides a good summary of the path that a
piece of mail took. Unfortunately, spammers can forge header information
easily; it is a trivial matter to insert arbitrary information to cover
their tracks. The last thing that many spammers want is to reveal their
identities and whereabouts. Various tools allow you to trace the paths of
messages. Detailed and systematic
analysis of the headers is often necessary to sort out what
doesn't make sense and spot the inconsistencies or impossibilities.
Was the message sent by bulk methods?
Another important clue is whether the email message
is bulk email. Were multiple copies of this email message sent? If you
received multiple copies of the same message, this often indicates that
the message was delivered in bulk by an automated tool. In most cases,
however, it is difficult to tell, as you receive only one copy, and cannot
know if one or one million such messages were sent. Our spam system's
unique architecture, however, enables staff to quickly verify if certain
messages were delivered in bulk. Bulk mail delivery alone does not
identify a message as spam, as many legitimate, solicited email messages
are sent in bulk. However, it is a worthwhile clue to consider.
What is the content of the email
Finally, the content of an email may provide clues
to whether it was unsolicited. Regarding content, the first giveaway is
usually the curious grammar and word choice that spammers seem to employ.
Certain patterns, such as liberal amounts of ALL CAPS and multiple
exclamation (!!!) points are often favored by spammers. As we discussed
earlier, spam as advertising attracts certain businesses more than others.
Many of these common types of email, such as multilevel marketing
messages, are often less than fully legal attempts to involve consumers in
schemes that may be completely fraudulent. An analysis of the content
involves verifying whether the email message:
* Advertises
something for sale
* Offers
money-making opportunities
* Advertises
pornographic web sites or products
* Contains
offensive material
* Otherwise
follows patterns typical of many other already-identified spam messages
* Contains
or attaches suspicious software code
Another thing to check is whether the apparent
"legitimate business" referred to in the content of the message
has a Web-based email account while advertising a business domain within
the message body. The clues discussed in this section are just some of
those used by the operation center staff to determine whether a message is spam.
No single clue alone causes operation center staff to treat a piece of email as spam.
If we could feasibly ask each user whether they had requested the
email, we would not need to use these indirect clues. A certain amount of
judgment is required to judge whether a specific email message is
unsolicited and whether it's spam. Below shows
the basic parts of a spam message, illustrating the discussion in this
section.

Traditional Anti-Spam Methods
The most logical and practical places to filter
email for spam are in the mail user agent (MUA) or mail transfer agent (MTA),
but the two are by no means equally effective. MUAs are the client
applications that allow users to retrieve and send mail from their
computers. Common MUAs include Netscape Messenger, Microsoft Outlook, and
Eudora. MTAs are like post offices; they are programs that reside on mail
servers and are responsible for routing and sometimes delivering mail. MTA
and MUA-based filtering is usually based on the header information, the
mailer type, or the IP address or domain name of the sender.
To filter at the MUA level requires that email users explicitly create anti-spam filters on their machines. This approach has a number of shortcomings. First, the onus of the anti-spam work is placed on the recipient. This is not only time-consuming, but also largely ineffective. Email users typically do not have the expertise to create effective filters, nor do they have access to the most current spam. Filters based on past spam will generally be ineffective in blocking current spam, as spammers constantly change their messages to avoid such filters. Any attempt to combat the flood of spam must itself leverage the power of the same networks that the spammers exploit, and must operate in real time, around the clock. The filters users create quickly become outdated because they are fighting yesterday's spam. In addition, by the time spam hits the userÕs machine, much of the damage is already done, in terms of storage costs on the mail server.
Filtering in the MTA, on the other hand, is often
accomplished by adding rules to the configuration for the specific mail
system running on the server. MTA level filtering is more effective than
MUA filtering because it enables filtering for a larger number of mail
accounts from a central point for administration. The drawback in this
case is that users need to provide spam messages and other information to
the email administrators so that current information can be incorporated
into an organizationwide filtering list. This method requires continuous
maintenance to keep the filter list current and effective, because it is
built in reaction to spamming activity. The filters are drawn from only
one ISP, and lack input on the types of spam circulating in the rest of
the Internet. Another problem is the tendency to identify "false
positives," cases in which legitimate mail is incorrectly identified
and filtered as spam. If the filter list is not made with care, or if
domains are incorrectly blocked, valid email messages are discarded along
with the spam.
Whether it's MUA filtering or MTA filtering, the
same essential problems exist. For individual ISPs and email users, the
available information about current spam attacks is limited, and the
Internet represents a huge playing field. Traditional measures both block
legitimate email and reduce productivity because service providers' staff
and the user community need to continually devote time to fighting this
problem. It's a never-ending battle because spammers' techniques and tools
are always changing.
Additionally, once a particular domain is blocked,
it is trivial for a spammer to obtain another one and resume spam attacks.
Because persistent spammers can easily obtain new IP addresses and new
domain names on a daily basis, reactive blocking and filtering is futile,
like trying to hit a moving target.
MC's Employed Solution
Our anti-spam solution is a server-side, Internet-wide,
solution that actively seeks out, identifies, analyzes, and ultimately
diffuses spam attacks before they can overwhelm networks and irritate
email users. Furthermore, it is part of a comprehensive solution that
blocks viruses and other threats that arrive via email.
This solution uses filters that are based on human
and/or machine analysis to determine if email messages should be routed
normally, sidelined, or modified. This is achieved through service and
software components, automated and human-directed functions to forge the
best defense against spam. The main service components are the "Probe
Network" and the Operations Center (OC).
Together these components add up to a dynamic and effective
solution to the spam problem, one that takes the guesswork out of spam
identification.
Probe Network
The Probe Network is a large collection of email
accounts with a statistical reach of over 100 million email addresses. The
email accounts in this pool are created worldwide and include addresses
hosted by some of the largest ISPs in the world. The email accounts that
are used for detection are called probe accounts. Probe accounts are the
first step in the real-time detection and analysis of spam. They attract
spam. As mentioned earlier, spammers are quite resourceful in their
harvesting of email addresses.
Many of the probe accounts, therefore, are
strategically seeded to attract and catch large quantities of spam.
Knowing where spammers go to collect email addresses helps to strengthen
the Probe Network. As a result, spammers never know if they are sending
mail to an unsuspecting recipient or to a probe account.
The structure of the Probe Network also provides
powerful evidence that helps to judge if a message is spam. This virtual
"net" of numerous accounts spread all over the Internet makes it
easy for us to quickly verify that a given message was sent using bulk
methods. When the same questionable message is caught by different probes,
alarms go off and we can take action.
The Operations Center
When a probe account detects a possible spam attack
on the Internet, the probe immediately routes the message to the
Operations Center (OC), a spam-analysis center staffed round-the-clock,
365 days a year. The OC consists of a dedicated team of email experts
whose mission is to provide swift, accurate responses to spam threats, and
pro-actively research and develop technologies that eliminate future
threats. Their duties include:
* Analyzing
incoming email from the Probe Network
* Developing,
validating, and transmitting anti-spam rules to our mail servers
* Managing
and seeding the accounts in the Probe Network
* Researching
spam attacks
* Collecting
statistics and information to evaluate the effectiveness of filtering
servers
The experts at the OC are another example of what sets our filtering service apart from other filtering systems. As we saw earlier, certain qualitative skills are essential to accurately distinguish spam from legitimate email. Most email users won't tolerate losing legitimate mail to the fight against spam. Our extremely low false positive rate is a direct result of the incorporation of the OC into the anti-spam process. The OC serves as an intelligent buffer between the spammer and the unwilling recipient of spam.
This added intelligence, however, doesn't come at
the expense of privacy. The OC only has access to mail addressed to the
probe accounts. The specialists at the OC have no access to email users'
personal email. In the end, the email user has final say. Our customers
can access a list of all blocked emails via our online control panel. We
refer to these spam messages as grey mail.
Email scanning
Using updated anti-spam rules transmitted from the
OC, the servers check the headers, contents, and other information in each
message and identifies grey mail (suspected spam). The grey mail is routed
to a special storage area.
Mail Flow with the anti-spam filtering
The diagram below shows how the mail flow process works:

Summary
This unique anti-spam solution can be summarized in three
steps:
1. Find Spam
First, spam is actively sought using a probe
network, an extensive array of dedicated email accounts with a statistical
reach of 100 million Internet addresses.
2. Identify Spam
When the probe network finds possible spam, it
forwards that email to the OC. There, spam experts verify that the email
is spam and write rules to block it. They send those rules to the scanning
servers.
3. Stop Spam
Using updated rules from the OC, scanning servers
identify and filter spam messages from incoming email. Grey mail is
diverted to a special storage area, where users review via our online
control panel.
|