Search:  
Gentoo Wiki

Bogofilter

Contents

Bogofilter Mail Filtering Solution

By Chris Smith

This guide was written so that bogofilter may be implemented in the "Email System for the Home Network" Guide. This guide proves that bogofilter can be used in client AND in server side filtering solutions, still leaving the user in total control.

The script contained in this guide depends on most of this guide being followed word for word. Feel free to edit and modify my guide and script for your own use, just post on this thread and let us know what your doing with it. We're very interested to see where this goes

All code contained in this documentation is released under the GPL Public Licence. Of course Right... Here we go!

root@server # emerge bogofilter

Bogofilter Instructions

1. Make the spam maildirs:
user@server $ cd ~
user@server maildirmake -f Spam .maildir/
user@server maildirmake -f Spam.False-Positives .maildir/
user@server maildirmake -f Spam.False-Negatives .maildir/

NOTE: If you change these, I hope you know python, as you will need to hack the script so it knows which maildirs to treat as spam.

2. Load your mail client and move ALL your spam mail out of your normal directories, and into the Spam directory.
3. OPTIONAL: If you have a LOT of mail (i.e. thousands), and not just spam either, all mail, you may choose to have a "Ham" directory, which you can put a selection of a few hundred messages in.
user@server $ cd ~
user@server maildirmake -f Ham .maildir/

The script will auto-detect the precense of a .Ham directory, so it won't walk all your maildirs.

4. Copy the following script, and name it as ~/bin/bogotrainer
File: ~/bin/bogotrainer
#!/usr/bin/python
import os, os.path

#Configuration entries. Not much ATM. More if needed.
bogodir = "~/.bogofilter/"
maildir = "~/.maildir/"
hams = ['.Spam', '.Trash', 'courierimaphieracl', 'courierimapkeywords']

#Leave everything below here unless you want to do some hacking :)
needdbs = 0
bogodir = os.path.expanduser(bogodir)
maildir = os.path.expanduser(maildir)

def cleanhamdirs(dir):
 	for ham in hams:
                if (os.path.split(dir)[-1][:len(ham)] == ham): return 0
        if os.path.split(dir)[-1] in ['cur', 'tmp', 'new']: return 0
        return 1

if os.path.isdir(bogodir):
 	print "Bogofilter directory found"
 	#Newer versions of Bogofilter (0.14 and newer) use, by default, a
 	#combined wordlist.db, which replaces goodlist.db (aka hamlist.db)
 	#and badlist.db (aka spamlist.db). Original code is as follows:
 	#------------
 	#I'm just assuming if the spamlist.db exists, goodlist.db does too
 	#Program will die if goodlist.db doesn't exist anyway.
 	#if os.path.isfile(os.path.join(bogodir, "spamlist.db")):
 	#	print "Databases found"
 	#else:
 	#	print "Databases NOT found. Generating..."
 	#	needdbs = 1
 	#------------
 	#New code is as follows:
 	if os.path.isfile(os.path.join(bogodir, "wordlist.db")):
 		print "Database found"
 	else:
 		print "Database NOT found. Generating..."
 		needdbs = 1	
else:
 	print "Bogofilter directory NOT found. Generating..."
 	needdbs = 1

if needdbs:
 	print "Generating databases:"
 	print "Registering spam messages from", os.path.join(maildir,".Spam/cur")
 	spamlist = os.listdir(os.path.join(maildir,".Spam/cur"))
 	for spam in spamlist:
 		spampath = os.path.join(maildir,".Spam/cur/",spam)
 		print "- ", spampath
 		os.system("bogofilter -s < " + spampath)
 	if os.path.isdir(os.path.join(maildir, ".Ham")):
 		#If a specific .Ham dir exists, use that.
 		print "Registering ham messages from", os.path.join(maildir,".Ham/cur")
 		hamlist = os.listdir(os.path.join(maildir,".Ham/cur"))
 		for ham in hamlist:
 			hampath = os.path.join(maildir,".Ham/cur",ham)
 			print "- ", hampath
 			os.system("bogofilter -n < " + hampath)
 	else:
 		#Or else, use everything that isn't spam!
 		print "Registering ham messages from", os.path.join(maildir,"cur")
 		hamlist = os.listdir(os.path.join(maildir,"cur"))
 		for ham in hamlist:
 			hampath = os.path.join(maildir,"cur",ham)
 			print "- ", hampath
 			os.system("bogofilter -n < " + hampath)
 		maildirs = [os.path.join(maildir,dir) for dir in os.listdir(maildir)]
 		maildirs = filter(os.path.isdir, maildirs)
 		maildirs = filter(cleanhamdirs, maildirs)
 		for dir in maildirs:
 			print "Registering ham messages from", dir
 			hamlist = os.listdir(os.path.join(dir,"cur"))
 			for ham in hamlist:
 				hampath = os.path.join(dir,"cur",ham)
 				print "- ", hampath
 				os.system("bogofilter -n < " + hampath)

# So, everything exists, this must be an "updating run", easy!
# First, correct misdetected ham from the false-positives directory,
# and move it into the inbox.
print "Correcting ham messages from", os.path.join(maildir,".Spam.False-Positives")
hamlist = os.listdir(os.path.join(maildir,".Spam.False-Positives/cur"))
for ham in hamlist:
 	hampath = os.path.join(maildir,".Spam.False-Positives/cur",ham)
 	print "- ", hampath
 	os.system("bogofilter -Sn < " + hampath)
 	#Feed it back through procmail :)
 	os.system("/usr/bin/procmail -d $USER < " + hampath)
 	os.remove(hampath)

# Now, correct misdetected spam, and put it in the Spam maildir :)
print "Correcting spam messages from", os.path.join(maildir,".Spam.False-Negatives")
spamlist = os.listdir(os.path.join(maildir,".Spam.False-Negatives/cur"))
for spam in spamlist:
 	spampath = os.path.join(maildir,".Spam.False-Negatives/cur",spam)
 	print "- ", spampath
 	os.system("bogofilter -Ns < " + spampath)
 	#Don't bother procmailing it, put it in spam! :)
 	os.rename(spampath, os.path.join(maildir,".Spam/cur",spam))
BEWARE: the script above won't work on maildir folders having spaces,
quotes, etc in the filename; use the following instead (dmr)
File: ~/bin/bogotrainer (fixed)
#!/usr/bin/python
import os, os.path

#Configuration entries. Not much ATM. More if needed.
bogodir = "~/.bogofilter/"
maildir = "~/.maildir/"
hams = ['.Spam', '.Trash', 'courierimaphieracl', 'courierimapkeywords']

#Leave everything below here unless you want to do some hacking :)
needdbs = 0
bogodir = os.path.expanduser(bogodir)
maildir = os.path.expanduser(maildir)

def cleanhamdirs(dir):
 	for ham in hams:
                if (os.path.split(dir)[-1][:len(ham)] == ham): return 0
        if os.path.split(dir)[-1] in ['cur', 'tmp', 'new']: return 0
        return 1

if os.path.isdir(bogodir):
 	print "Bogofilter directory found"
 	#Newer versions of Bogofilter (0.14 and newer) use, by default, a
 	#combined wordlist.db, which replaces goodlist.db (aka hamlist.db)
 	#and badlist.db (aka spamlist.db). Original code is as follows:
 	#------------
 	#I'm just assuming if the spamlist.db exists, goodlist.db does too
 	#Program will die if goodlist.db doesn't exist anyway.
 	#if os.path.isfile(os.path.join(bogodir, "spamlist.db")):
 	#	print "Databases found"
 	#else:
 	#	print "Databases NOT found. Generating..."
 	#	needdbs = 1
 	#------------
 	#New code is as follows:
 	if os.path.isfile(os.path.join(bogodir, "wordlist.db")):
 		print "Database found"
 	else:
 		print "Database NOT found. Generating..."
 		needdbs = 1	
else:
 	print "Bogofilter directory NOT found. Generating..."
 	needdbs = 1

if needdbs:
 	print "Generating databases:"
 	print "Registering spam messages from", os.path.join(maildir,".Spam/cur")
 	spamlist = os.listdir(os.path.join(maildir,".Spam/cur"))
 	for spam in spamlist:
 		spampath = os.path.join(maildir,".Spam/cur/",spam)
 		print "- ", spampath
 		os.system('%s "%s"' % ("bogofilter -s <", spampath))
 	if os.path.isdir(os.path.join(maildir, ".Ham")):
 		#If a specific .Ham dir exists, use that.
 		print "Registering ham messages from", os.path.join(maildir,".Ham/cur")
 		hamlist = os.listdir(os.path.join(maildir,".Ham/cur"))
 		for ham in hamlist:
 			hampath = os.path.join(maildir,".Ham/cur",ham)
 			print "- ", hampath
	 		os.system('%s "%s"' % ("bogofilter -n <", hampath))
 	else:
 		#Or else, use everything that isn't spam!
 		print "Registering ham messages from", os.path.join(maildir,"cur")
 		hamlist = os.listdir(os.path.join(maildir,"cur"))
 		for ham in hamlist:
 			hampath = os.path.join(maildir,"cur",ham)
 			print "- ", hampath
	 		os.system('%s "%s"' % ("bogofilter -n <", hampath))
 		maildirs = [os.path.join(maildir,dir) for dir in os.listdir(maildir)]
 		maildirs = filter(os.path.isdir, maildirs)
 		maildirs = filter(cleanhamdirs, maildirs)
 		for dir in maildirs:
 			print "Registering ham messages from", dir
 			hamlist = os.listdir(os.path.join(dir,"cur"))
 			for ham in hamlist:
 				hampath = os.path.join(dir,"cur",ham)
 				print "- ", hampath
		 		os.system('%s "%s"' % ("bogofilter -n <", hampath))

# So, everything exists, this must be an "updating run", easy!
# First, correct misdetected ham from the false-positives directory,
# and move it into the inbox.
print "Correcting ham messages from", os.path.join(maildir,".Spam.False-Positives")
hamlist = os.listdir(os.path.join(maildir,".Spam.False-Positives/cur"))
for ham in hamlist:
 	hampath = os.path.join(maildir,".Spam.False-Positives/cur",ham)
 	print "- ", hampath
	os.system('%s "%s"' % ("bogofilter -Sn <", hampath))
 	#Feed it back through procmail :)
	os.system('%s "%s"' % ("/usr/bin/procmail -d $USER <", hampath))
 	os.remove(hampath)

# Now, correct misdetected spam, and put it in the Spam maildir :)
print "Correcting spam messages from", os.path.join(maildir,".Spam.False-Negatives")
spamlist = os.listdir(os.path.join(maildir,".Spam.False-Negatives/cur"))
for spam in spamlist:
 	spampath = os.path.join(maildir,".Spam.False-Negatives/cur",spam)
 	print "- ", spampath
	os.system('%s "%s"' % ("bogofilter -Ns <", spampath))
 	#Don't bother procmailing it, put it in spam! :)
 	os.rename(spampath, os.path.join(maildir,".Spam/cur",spam))
5. Now, make the script executable:
user@server $ chmod +x ~/bin/bogotrainer
6. If you have a previous training of bogofilter, the script won't overwrite it (so it's cronjob-able) but it's a good idea to start fresh.
user@server $ rm -rf ~/.bogofilter
7. Run the script, and wait while it takes in all of your mail and builds it's databases. Bogofilter is quite fast, so it shouldn't take too long and you get to see it's progress!
user@server $ ~/bin/bogotrainer
8. Add these recipies before all your other recipies:
File: ~/.procmailrc
 #Bogofilter Filtering Solution
 :0fw
 | bogofilter -u -e -p

 :0e
 { EXITCODE=75 HOST }

 :0:
 * ^X-Bogosity: Spam,
 .Spam/
9. Add this line to your crontab:
0 23 * * * ~/bin/bogotrainer >/dev/null 2>&1

This sets it to run once a day at 11pm, you can change it. Once a day is about right.

10. Done! Now you have 2 sub spamdirs which you can use to train bogofilter as you see fit, right from your mail client.

When you recieve a mail that bogofilter moves to your spam directory, but isn't actually spam, move it into the False-Positives dir in your email client. You can either run the script immediately, or wait until the cronjob triggers. It retrains bogofilter correctly, then feeds the mail back through procmail for proper classicification. If it happens again, don't ignore, put it back in the False-Positives dir and run the script again until bogofilter learns it correctly!

When you recieve a spam in your inbox, move it into the False-Negatives directory. Next time the script is run, it will retrain bogofilter to recognise that mail as spam then the mail is moved into your .Spam maildir.

When you feel that your bogofilter is 100% accurate (when it comes to false-positives, you don't want to lose any mail) you can edit your ~/.procmailrc so that when bogofilter detects a mail as spam, it moves it to /dev/null (deleting it). Use with caution! But with that method, you don't even have to look at the filth!

Options

You can change options like the classification in Ham, Unsure and Spam by creating a configuration file that overrides the default values. You can view the variables and their values by executing bogofilter -QQ. The configuration file is specified in the variable config-file. It should contain the path ~/.bogofilter.cf by default.

It can be useful to lower the spam cutoff as spammers begin to vary their contents (e.g. by adding new words to the messages). In most of the cases the spam mails will be classified correctly after lowering the value but you should not stop training the database so that it will recognize the same or a similar mail with a sureness of 100% the next time. To change the spam cutoff, you will need to override the variable spam_cutoff. You should observe what spamicity value misclassified mails have and change the value to your needs. The default value only allows a probability 99%. It really depends on the mails you are receiving but a probability of 90% or even 85% would give you better results in many cases.

Conclusion

Well, I think that's about it for this. If there is anything I've forgotten, don't hesitate to drop me a PM. I will give out my email over PM if needed. I may look at updating and streamlining the script soon, so check back here in a little while.

Thanks and References

Thanks a lot to beowulf for creating this awesome guide, and all the other active participants on this thread (Proteus in particular). The community is what makes Gentoo thrive!

The sites I used researching this little project are as follows:

Retrieved from "http://www.gentoo-wiki.info/Bogofilter"

Last modified: Fri, 05 Sep 2008 10:07:00 +0000 Hits: 679