Welcome to week number 2 of the fantasy football-themed weekly tip. I had difficulties tying football into this week, but lo and behold as if it were a sign from the big cheese, Peyton Manning, himself — who I drafted, but lucked out on Harrison/Wayne to complete the trifecta — we’ve got yet fantasy football-themed tip. At this point you may begin taking bets on how long I can keep it up. Two weeks and counting…

This week we’ll be looking at a lovely marriage between the e-mail client Mozilla Thunderbird and sa-learn. Although I will refer to Thunderbird explicitly, this system works with any e-mail client that can use IMAP and labels. Don’t feel left out if you are using Outlook, Mail.App, Eudora (which is on its ninth life), or any e-mail client. A label is an arbitrary assignment to a particular e-mail. For example, hitting “1” on the keyboard may assign “Important” to the message in your inbox while “2” may mean “Work” and so on. In Thunderbird, “j” is a shortcut to label a message as “Junk”. We’ll use this special label to automate SpamAssassin teaching.

Junk Label

Let’s face it. Thunderbird’s junk controls are terrible. Let’s also face the fact that even though this is evidently a spam with link to malware it doesn’t stop me from forwarding to my current opponent this week, “Nasty Audibles”. See, we have accidental death/dismemberment/incapacitation clauses every year in our league and I’m down heading into Monday Night Football. I will take every opportunity to guarantee a victory during week 1… forwarding malicious e-mails included.

Toggling “Junk” status in Thunderbird labels a message as junk, but what happens internally? IMAP is designed for synchronization among e-mail clients. Each e-mail client on each computer has access to the same status information such as whether it has been read, replied to, forwarded, deleted, and any arbitrary labels (Important, Work, Junk, etc). These status changes are stored either in the message or in the file name itself. Since the major software changes in February [note: a change from mbox to Maildir++ introduced the one file-per-message style of storage], these status changes are stored in the file name and thus easily accessible by a simple shell script.

Digging deeper, let’s examine a message in our main mailbox in ~/Mail/cur/:

1189385193.M835488P16631V0000000000000905I00BB08BB_0.assmule.apisnetworks.com,S=1290:2,RSk

The last group of characters (RSk) in the comma-delimited set holds the message info. Other metadata in the file name is inconsequential for the purpose of this tip, but you may always read more about it on Dovecot’s wiki. We have three flags in the info field:

  • R: replied to message
  • S: message has been viewed by the e-mail client
  • k: special arbitrary label set by Thunderbird (“non-junk” label)

Watch what happens when the message is labeled as “Junk” in Thunderbird:

1189385193.M835488P16631V0000000000000905I00BB08BB_0.assmule.apisnetworks.com,S=1290:2,RSj

One thing changed in the info field: k to j. I know that light bulb has gone off in your head now, but don’t bust out your mat just yet. As I mentioned earlier that labels are arbitrary. How do you find out what label means what? Check ~/Mail/dovecot-keywords. Here are the labels for my IMAP account:

0 unknown-0
1 unknown-1
2 unknown-2
3 unknown-3
4 unknown-4
5 unknown-5
6 unknown-6
7 unknown-7
8 unknown-8
9 Junk
10 NonJunk
11 $Label1
12 $MDNSent
13 $Label2
14 $Forwarded
15 $Label3
16 $Label4
17 $Label5

There are 26 possible labels, 0-25, which correlate to 0 = a… 25 = z. The 9th index is labeled Junk and the 10th character in the alphabet is j… the 11th is k. Now, do the labels in the file name make sense? These vary between user accounts. Even though the 9th index may be Junk for my e-mail account, it may be the 21st for you. Always double-check dovecot-keywords before asserting the character association. Treating forwarded e-mails as spam because $Forwarded is in the 9th index spot, which is a “j” in the file name would be disastrous! Note that labels in the file name are case sensitive. a-z are reserved for the labels defined in dovecot-keywords. “a” does not have the same meaning as “A” in the info field of the file name.

Holding status information in the info field of the file name gives Maildir a big advantage in terms of easy manipulation. Let’s take this behavior and use the messages labeled as junk in Thunderbird to periodically move to our Spam folder and then feed those messages to sa-learn. Teaching SpamAssassin missed messages enhances its Bayesian database, which in turn increases the effectiveness of tagging spam. You can use the turnkey SpamAssassin configuration wizard to tweak delivery rules. For example, I deliver messages that score between a 5 and 10 into “Spam”, but generate a delivery failure notice. This allows me to inform the user that (a) there was a delivery problem, but (b) I can still return later to the message to determine whether it was spam or ham. If ham, then I can reply to it with a note about the delivery error. Anything scoring above a 10 is automatically deleted. Having a well-trained Bayesian database enhances the scoring capability of SpamAssassin.

#!/bin/sh
# Change this to the correct "Junk" label
JUNK=j
# Target mail folder
HOLD_FOLDER=~/Mail/.Spam/cur/
# Number of days to hold messages in mailbox and purgatory after label change
DAYS=7

find $HOLD_FOLDER -ctime +$DAYS -maxdepth 1 -type f -not -regex ',[^,]*T[^,]*$' -exec sa-learn --spam {} > /dev/null \;  -exec rm -f {} \;
find ~/Mail/cur/ -ctime +$DAYS -maxdepth 1 -type f -regex ".*,[^,$JUNK]*$JUNK[^,$JUNK]*\\$"  -exec mv {} $HOLD_FOLDER \;

If you would like to get learning status of which messages were learned, then change the first find command checking $HOLD_FOLDER to (find $HOLD_FOLDER -ctime +$DAYS -maxdepth 1 -type f -not -regex ',[^,]*T[^,]*$' -exec awk '($0 ~ /^Subject:/) { print substr($0,10) ; system("sa-learn --spam '{}' > /dev/null"); }' {} \; -exec rm -f {} \;) 2>&1 | mail -s "SA Learn Status" msaladna@apisnetworks.com

Of course you would replace msaladna@apisnetworks.com with your current e-mail address. The following options may be configured to meet your needs:

  • HOLD_FOLDER: messages labeled as Junk will be moved to this IMAP folder after DAYS days. Messages in this folder will be fed to sa-learn as spam
  • DAYS: number of days a message will remain labeled as “Junk” or sit in the HOLD_FOLDER before going to the next step
  • JUNK: custom label set by the e-mail client to denote spam. Check ~/Mail/dovecot-keywords for the correct label position.

There are only two commands, but the syntax may be daunting, so let me walk you through what happens. First, we check HOLD_FOLDER for any messages last changed n DAYS ago. These are fed to sa-learn as spam and then deleted. A regular expression is used to ensure messages with a T status in the info field are not fed to sa-learn. “T” is another special indicator that means the message has been moved to another IMAP folder. These dangling messages will exist whenever you move them to a different folder (this includes deleting!) without compacting the folder. In Thunderbird that option is accessible by right-clicking on the mailbox in the left pane and selecting “Compact Folder“. Imagine if you improperly labeled a message as junk and moved it to HOLD_FOLDER. Shortly after realizing your mistake, you dragged the message out of HOLD_FOLDER back into your main mailbox. If the “T” flag wasn’t checked, then this message would (a) be fed to sa-learn as spam and (b) deleted from the mailbox.

After HOLD_FOLDER has been purged it’s time to bring in a new batch of messages labeled as “Junk”. The find command will search for all messages labeled with the JUNK flag which are older than DAYS days. Messages matching these criteria will be moved to HOLD_FOLDER… and the cycle repeats itself the next time the script is run.

You probably want to automate these commands, so setup a cronjob in the control panel under “Cronjob Manager” and paste the code to a file named relearn_spam.sh. Upload the file to your home directory and add a cronjob set to run the command “sh ~/relearn_spam.sh” at 0 0 * * 0 (every Sunday at midnight). And that’s how you create value between your e-mail client, Thunderbird in my example, and spam filtering.

Final thoughts: because you’re the one marking missed messages in Thunderbird, it’s a good idea to disable automatic tagging by Thunderbird. Visit Tools -> Account Settings -> <account name> -> Junk Settings and untick “Enable adaptive junk mail controls for this account“. Remember that you have a week after tagging a message as junk to remove it from the junk folder before it is fed to sa-learn as spam! Be sure to setup a set time like every Friday before you leave the office to scour over messages and remove anything improperly tagged as junk.

That wraps up the tip for this week. Who knows what’s in store next week, but I’m keeping my fingers crossed that I can continue the fantasy football theme.

Weekly Tip #2: Streamlining SpamAssassin’s Learning Process

5 thoughts on “Weekly Tip #2: Streamlining SpamAssassin’s Learning Process

  • September 11, 2007 at 11:56 am GMT-0500
    Permalink

    Matt — These tips are great. The version of Eudora that I’m using (only the “8th life” version as of yet) apparently doesn’t pass the tags back to the server, so this method doesn’t currently work. What I’ve been doing is to just move the spams (manually, in the client) to a Spam folder and then run a nightly job to run sa-learn on the spam folder and delete it when done. Since I’m doing the sorting pretty much completely manually, it’s pretty much zero risk of learning a false positive, but on the other hand, I do have to do it all manually.

    The other difference is that in my case, I don’t leave email on the server after I’m done reading, at least not permanently. I often read at the office, do a little processing, then collect things at home, where I keep permanent archives of certain emails. I wind up using both POP and IMAP to manage the emails, mostly because I can’t quite do what I want completely with either. Hopefully Eudora/Penelope will handle things better.

  • September 12, 2007 at 11:17 am GMT-0500
    Permalink

    After a bit of Googling on the subject, it looks like label 15 will be stored server-side. To access the label key mapping, create an empty e-mail and paste <x-eudora-setting:32629> into the message body. It should be transformed into a hyperlink that you can click. For reasons unbeknown to me I don’t see that behavior. It may be because I am running Eudora 7 in sponsored mode. Here’s the exhaustive list of settings in Eudora that you can tweak.

    I hope that helps you and your Model T-equivalent e-mail client out 😉

  • September 16, 2007 at 6:05 pm GMT-0500
    Permalink

    Thanks for the research Matt. Unfortunately, I found out why clicking on doesn’t work for me, and might not have worked for you. It seems the is specific to the Mac version of Eudora, which I don’t have. (The Windows equivalent is So, I looked for the equivalent setting for the Windows version, and apparently there isn’t one. The only thing that I could find was a 2 year old comment in a Eudora forum that said that the functionality was planned for a “future version” of Eudora, which apparently never happened. There are some comments in some fairly old release notes that make it look like the Junk function should actually do something useful for an IMAP server, but I can’t find anything in the documentation that explains how to get any of that functionality to actually works. Looks like I might have to look at Thunderbird/Penelope/whatever-it’s called if I want to get this to work.

  • September 16, 2007 at 9:05 pm GMT-0500
    Permalink

    Testing…

  • September 17, 2007 at 9:54 pm GMT-0500
    Permalink

    One comment and one question:

    I’m using a far more manual method of moving the Spam around to be learned, but it has essentially the same effect. The suggestion that I have would be to at least learn the spam more often. It seems that these things come in waves (e.g., all those football spams), and a week later, it seems like a lot of times they’ve subsided. As a result, I run a nightly spam cleanup, which runs sa-learn and deletes the folder. My thought is that by doing this, I’ll teach SA to ignore those new ones faster.

    Now for my question: I started to experiment with “Eudora 8.0” (which is the one built using Thunderbird as a base, with the magic “Penelope” plug it to make it more Eudora-like.) I still don’t have a dovecot-keywords file. Is that normally created by Thunderbird?

Comments are closed.