Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

logic question



On Tue, 22 Feb 2005, Bill Holt wrote:

> Hello, I have postfix/spam assassin/redhat es4.0 I'm stumped on how to
> seed the bayesian database. The corpus @ wiki is old (don't want to seed
> it with email from 2004), and I am using this machine as a gateway to an
> exchange server. So by the time the email gets to the exchange server,
> It's useless to me. My question is how to get the spam back on the
> gateway for processing. Do I just take spam from users and write rules
> accordingly? I'm a little lost at the best way to approach this. Any
> pointers in the right direction would be greatly appreciated. Thank you,
> Bill

I was just talking to a coworker (and now BLU member) about that this 
morning.  Steve, consider this your answer, too.

You know that spamassassin doesn't say whether an email is spam or not, it 
gives it a numerical rating, and you can do different things with emails 
of different ratings.  I have mailboxes for _SpamMaybe and _SpamSAYes, 
where possible and very likely spam messages respectively get dumped.

I also have folders SpamSASpam and SpamSAHam.  As I find messages not 
rated highly enough as spam, either in _SpamMaybe or any other folder, I 
move it to SpamSASpam.  Likewise, any non-spam messages that get caught as 
spam, I copy to SpamSAHam.  Then I have a script on my mail server that 
trains the database from those folders, and moves their content to an 
offline file.  This is a cut-down version of this script:

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#!/bin/bash
SRCDIR=~/IMAP
DSTDIR=~/IMAPARCHIVE
if [ -s $SRCDIR/SpamSASpam ] ; then
    echo Found spam
    sa-learn --spam --mbox $SRCDIR/SpamSASpam
    cat $SRCDIR/SpamSASpam >> $DSTDIR/SpamSASpam
    cp /dev/null $SRCDIR/SpamSASpam
fi
if [ -s $SRCDIR/_SpamSAYes ] ; then
    echo Found spam already caught
    cat $SRCDIR/_SpamSAYes >> $DSTDIR/SpamSASpam
    cp /dev/null $SRCDIR/_SpamSAYes
fi
if [ -s $SRCDIR/SpamSAHam ] ; then
    echo Found ham
    sa-learn --ham --mbox $SRCDIR/SpamSAHam
    cat $SRCDIR/SpamSAHam >> $DSTDIR/SpamSAHam
    cp /dev/null $SRCDIR/SpamSAHam
fi
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NB: This script should really look for the procmail lock files before 
copying/truncating the files, but it's just not that big a deal.

You will note that $DSTDIR/SpamSASpam grows indefinitely.  This is a good
thing.  I just had a problem on my system where an update of Perl broke
DB_File (Thank you, SuSE), and all hell broke loose on my bayes files.  
Upgrading spamassassin did no good (though the new version is MUCH
better).  I eventually ended up deleting them, but I had my big, fat,
corpus of spam for the past year or so to retrain with.

WARNING: Bayes won't work well unless you feed it ham, too.  Don't forget
to train both ham and spam.

You're welcome to my corpus, if the fact that the emails are to me instead 
of you won't affect it.  It's about 24MB.

-- 
DDDD   David Kramer         david at thekramers.net       http://thekramers.net
DK KD  
DKK D  It is the business of the future to be dangerous
DK KD  
DDDD                                                              -DJ SPooky




BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org