Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
Kristian Hermansen wrote: > How does Gmail do it? Do they utilize the fact that millions of their > users (agents) help in the learning process of what is 'spam' by > clicking that 'Report Spam' button? ... And in a later posting: > Sender Reputation in a Large Webmail Service, Bradley Taylor, Third > Conference on Email and Anti-Spam (CEAS 2006), 2006 > > Short read too: > http://www.ceas.cc/2006/19.pdf That was an interesting read. Their technique is pretty simple, and essentially it does work as you originally speculated before you found the paper. The technique can be summed up as: First, they determine who the sending party is. Unlike most spam filtering systems, they avoid relying on IP addresses. Instead they depend heavily on SPF[1] and DomainKeys[2]. Because these mechanisms aren't widely used yet, they expand the scope of SPF by using a "best guess" rule[3] to figure out whether the sending machine's IP address is a likely match for the domain. According to their stats, only 26% of the non-spam messages they receive can't be authenticated using one or more of these techniques, while only about 40% of the spam can be authenticated. 1. http://www.openspf.org/ 2. http://www.ietf.org/html.charters/dkim-charter.html 3. http://www.openspf.org/FAQ/Best_guess_record Next, they calculate a "reputation" for the sender, which is a percentage showing how non-spammy they are. (0% is all spam, 100% is all non-spam.) Feeding into that calculation are the counts of users marking messages from that sender as spam, or not-spam, as well as stats showing how past messages from that sender were classified. Their charts show that senders tend to cluster towards the top or bottom of the spectrum. Most are either below 5% or above 80%. If the reputation is below a threshold, say 5%, it's spam. If it's above another threshold, say 80%, it's non-spam. All the stuff that falls in the middle gets sent to a statistical filter. (The paper didn't mention which filter. Similarly, the paper doesn't address what other anti-spam techniques, like greylisting, that Gmail may or may not be using.) So largely they depend on their users to determine whether a sender is spammy. (The paper seems to suggest that while the votes from all users are used in aggregate to calculate a senders reputation, if an individual marks a sender a certain way, mail from that sender will be sorted accordingly for that specific user. In other words, individual users have their own white lists and black lists that override the normal formula.) Their system seems to be heavily dependent on their ability to authenticate the sender. Oddly absent from the paper is a discussion of what they do about the senders that can't be authenticated (26% of non-spam and 60% of spam). I wonder how they are even counting those senders (in their stats), if they can't determine who they are, and they aren't falling back on using IP addresses. They could be counting thousands of fictitious domains as unique senders if they're only looking at the domain. The paper says one of the challenges to their system is that some users don't log in to the web UI, and thus never classify messages. Perhaps some day they'll switch from POP to IMAP (so users can remotely browse their spam folder), and provide something like a Thunderbird extension so users can classify messages. The paper concludes by comparing their system to several existing systems like SpamCop, Return Path?s Sender Score, Habeas' SenderIndex, some of which return a binary spam/not-spam indicator, and a few that return a score. But again they pointed out that these systems rely on the sender's IP address and say, "Using the authenticated domain, rather than the IP address though, would be a welcome improvement to these systems." The author of the paper seems to be almost disappointed that Google has amassed this database of information on senders, but doesn't want to share it with the public, and he encourages the development of an open system that applies the same techniques: "It would be nice if a third-party service could provide something similar that everyone could use." Could be an interesting project... I have an in-house developed anti-spam proxy that I use on our mail server, and I'll probably try incorporating some of these techniques. > However, Gmail catches them every time :-) I wouldn't say every time. But it does a darn good job. I primarily use Gmail via POP, so it is inconvenient to reclassify messages, but have done so on a few occasions - both for false positives and false negatives. Perhaps a few times a quarter I'll get some spam. More frequently for spam redistributed by mailing lists. -Tom -- Tom Metro Venture Logic, Newton, MA, USA "Enterprise solutions through open source." Professional Profile: http://tmetro.venturelogic.com/ -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |