Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

procmail and duplicate mail



> Anyone have a simple procmail recipie for eliminating duplicate mail?

First, I'd like to point out the following example included in the
procmail documentation (try 'man procmailex'):

              :0 Wh: msgid.lock
              | formail -D 8192 msgid.cache

This is the canonical duplicate message filter.  It simply tosses any
message that has the same messageid as one you've already received. 
You may also want to check the procmail mailing list archive at:

  http://www.xray.mpe.mpg.de/mailing-lists/procmail/

Which gets this question probably once or twice a day :).

Here are some of my solutions...

[Note: the following examples were cribbed straight from my procmail
configuration, and use several variables that you won't actually see
defined in this message.  If their content is not immediately apparent,
feel free to ask me for clarification.]

The following is what I'm actually using.  Rather than just discarding
the message, it sticks a note in the log file, marks the message
header, and sticks it in my dupes folder (from where it will be
automatically expired at some later date):

##
## MESSAGE-ID CHECK
##

:0
* ^Message-id:
* ? formail -D $msgid_cache_size $msgid_cache_file
{
        LOG="dupecheck: msgid discard$NL"

        :0fwh
        | formail -A "$STATUS_HEADER: msgid duplicate"

        :0
        { FOLDER=$dupedest INCLUDERC=$RCDIR/save.rc }
}

The downside to message id checking is that if 5 people forward you the
exact same thing, this filter won't catch it.  If you've got spare
cycles on your machine, the following filter may be of interest.

It strips out redundant whitespace in a message, converts tabs to
spaces, and then computs the MD5 checksum of what's left.  It caches
the checksum, and checks future messages against the cache.  It will
weed out all messages with duplicate content:

##
## CONTENT MD5 CHECK
##

## get the MD5 checksum for this message
:0b
md5sum=|tr -s '\n\t ' '   '\
       |md5

## if a duplicate checksum exists, dump the message
:0
* ? fgrep -s $md5sum $md5_cache_file
{
        LOG="dupecheck: md5 discard$NL"

        :0fwh
        | formail -A "$STATUS_HEADER: md5 duplicate"

        :0
        { FOLDER=$dupedest INCLUDERC=$RCDIR/save.rc }
}

## Otherwise, add the checksum to the md5 cache and continue to process
## the message.
:0Ehci
| echo "$md5sum" >> $md5_cache_file

## Delete the cache if delivery of this message fails.  This will
## ensure that redelivery attempts won't be rejected.
TRAP="${TRAP:+${TRAP}; } test \$EXITCODE -eq 75 &&
	rm -f $md5_cache_file"

Note that there is an external script, run out of cron, the
periodically truncates the cache file so that it doesn't grow without
bounds.

Isn't this far more information that you wanted? :)

-- Lars


=====
lars at larsshack.org --> http://www.larsshack.org/
__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com
-
Subcription/unsubscription/info requests: send e-mail with
"subscribe", "unsubscribe", or "info" on the first line of the
message body to discuss-request at blu.org (Subject line is ignored).




BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org