|
|
||||||||||||||||||||||||||||||
|
SpamBayes Frequently Asked Questions
Contents
This is a work in progress. Please feel free to ask questions and/or provide answers; For help with SpamBayes, send email to the SpamBayes mailing list. 1 Overview1.1 What is SpamBayes?SpamBayes is a tool used to segregate unwanted mail (spam) from the mail you want (ham). Before SpamBayes can be your spam filter of choice you need to train it on representative samples of email you receive. After it's been trained, you use SpamBayes to classify new mail according to its spamminess and hamminess qualities. It's best to train on recent email, because your interests and the nature of what spam looks like change over time. When SpamBayes filters your email, it compares each unclassified message against the information it saved from training and makes a decision about whether it thinks the message qualifies as ham or spam, or if it's unsure about how to classify the message. It then adds its classification to the message, either by adding a header (X-Spambayes-Classification: spam|ham|unsure), modifying the To: or Subject: headers, or adding a "Spam" field to the message. Depending on which SpamBayes application you are using, it may then filter this message for you, or you can set up your own filters (to file away suspected spam into its own mail folder, for example). 1.2 What is the license? How much does it cost? Can I pay anyway?SpamBayes is free and open-source - there is no charge. The software is released under the PSF license. If you really feel that your life would be incomplete without giving something back to the project, you have a few options:
1.3 What online resources are available?There are five mailing lists which support SpamBayes:
All the mailing lists are managed by Mailman, which uses pipermail to archive messages. There is no search capability, so your best bet is to use Google to search for stuff. For example, searching for site:mail.python.org sb_server -checkins would search for messages which mention sb_server but exclude check-in messages. 1.4 How do I subscribe/unsubscribe from the SpamBayes mailing lists?To subscribe, visit the relevant page referenced in the previous Q&A and follow the directions there. To unsubscribe, visit the same link. Scroll to the bottom of the page and enter your email address in the last field then click the "Unsubscribe or edit options" button. From there you can simply click the "Unsubscribe" button. A confirmation mail will be sent to your email address. To confirm the unsubscribe request, simply reply to that message. 1.5 What do I need to install SpamBayes?Unless you want to run from the source code, all you need is the SpamBayes installer. At present, any Windows users using Outlook or using any other mail client and retrieving mail via POP3 can use the installer. At present, all other users must run from source. If you want to run from source, you must have a recent version of Python installed on your computer, version 2.2 or later. (Don't ask about backporting it to earlier versions of Python. It's almost a certainty this won't happen.) If you need to install Python on your system, check the Python download page for the version appropriate to your computer. You also need version 2.4.3 or above of the Python "email" package. If you're running Python 2.2.2 or above, then you already have this. If not, you can download it from the email-SIG and install it. Unpack the archive, cd to the email-2.4.3 directory and type "python setup.py install" (YMMV on different platforms). This will install it into your Python site-packages directory. You'll also need to move aside the standard "email" library - go to your Python "Lib" directory and rename "email" to "email_old". 1.6 Is there a high level summary that shows how SpamBayes works?There are eight main components to the SpamBayes system:
1.7 Where does all this stuff live?The filter script is called sb_filter.py. The POP3 proxy lives in sb_server.py; to upload to the server use sb_upload.py. The IMAP filter lives in sb_imapfilter.py. These all live in the scripts directory - Windows users can find non-command-line versions in the windows directory. The Outlook plug-in lives in the Outlook2000 subdirectory - see the README.txt in that directory for more information on that. As well as these components, there's also a whole pile of utility scripts, test harnesses and so on - see README.txt, README-DEVEL.txt and TESTING.txt in the SpamBayes distribution for more information. 2 Compatibility2.1 Does SpamBayes work on Windows Vista?SpamBayes does work on Vista, though you will have to tweak the permissions on: c:\program files\spambayes More details are available here: http://mail.python.org/pipermail/spambayes-bugs/2007-January/004119.html Note that you will probably have to execute the installer with elevated privileges. Right-clicking on the EXE and selecting "Run as Administrator" should work (and will be necessary even if you are logged in as an admin user). 2.2 Does SpamBayes work with Outlook Express?Outlook Express isn't a version of Outlook, it's a completely separate program (from the same company). Because they give it away for free, Outlook Express is a really stripped down program, and it's extremely difficult to create a plugin for it. You can use sb_server and/or sb_imapfilter with Outlook Express. Because Outlook Express does not let you filter on arbitrary headers (like X-Spambayes-Classification), sb_server must add the classification to the "To:" line, or the "Subject" line. The configuration page has options that let you do this. You can find detailed instructions about how to do this, step by step in this FAQ question. Once you've set up sb_server, you also need to create rules (like any other rules) in Outlook Express, to take the appropriate action on mail based on its classification (move spam to a spam folder, for example). You do this by searching the "To:" or "Subject:" line for the classification - since you can't set the rule to look only at the start of the line, we recommend that you search for (e.g.) "spam," rather than simply "spam", so that you don't catch other messages, like ones from the SpamBayes mailing list. Even this can cause troubles if you get messages with subjects like "I get a lot of spam, do you?". In this case, you probably need to alter the tag that SpamBayes uses (to 'SBSpam' or something else), which you can do by editing the configuration file to include the header_spam_string option. sb_server/sb_imapfilter aren't quite as 'transparent' as the Outlook plugin, but they're still quite easy to use/setup, and they use the same core, so the results will be the same. We are working on an alternative that should move us towards the Outlook simplicity. 2.3 What is ThunderBayes?ThunderBayes (http://pieces.openpolitics.com/thunderbayes/) is an extension for the Thunderbird email client. It provides a toolbar button similar to Thunderbird's Junk button with which email can be classified as Spam or Ham. Clicking the button causes two things to happen: (1) it sends the source of the selected messages to SpamBayes to be classified and (2) it optionally moves the messages to a folder of your choice (this can be configured in the extension options). It includes a custom version of SpamBayes, and provides a simple preference page in the Thunderbird Account Settings where the SpamBayes POP3 proxy and message filters can be configured. 2.4 What clients will SpamBayes work with in general?SpamBayes will work with most POP3 or IMAP compatible clients. How you implement depends on your local architecture. Users with access to procmail can just write a recipe that invokes SpamBayes like this: :0fw | /opt/spambayes/sb_filter.py Follow that with a recipe to check the results and take action: :0 * ^X-Spambayes-Classification: spam ${MAILDIR}/spam Emacs and XEmacs both come with VM, one of a choice of several Emacs-based mail packages. Emacs is extensible using Emacs Lisp or Pymacs. This extensibility allows you to easily segregate your incoming mail for training purposes. Here's one such example. If you place the following code in your ~/.vm file: (defun train-as-spam () (interactive) (let ((vm-delete-after-saving nil)) (vm-save-message (expand-file-name "~/tmp/newspam")) (vm-add-message-labels "trained" 1)) (vm-pipe-message-to-command "sb_filter.py -s >/dev/null" nil)) (defun train-as-nonspam () (interactive) (let ((vm-delete-after-saving nil)) (vm-save-message (expand-file-name "~/tmp/newham")) (vm-add-message-labels "trained" 1)) (vm-pipe-message-to-command "sb_filter.py -g >/dev/null" nil)) (define-key vm-mode-map "ls" 'train-as-spam) (define-key vm-summary-mode-map "ls" 'train-as-spam) (define-key vm-mode-map "lh" 'train-as-nonspam) (define-key vm-summary-mode-map "lh" 'train-as-nonspam) Typing "ls" will save a copy of the current message to ~/tmp/newspam and "lh" will save a copy of the current message to ~/tmp/newham. You can then use those files later as arguments to hammie.py for training. Both commands also add a "trained" label to the message in question as a visual reminder that you've already added the message to your training database. Users limited to POP3/IMAP communications to the server can use the POP3 or proxy or IMAP filter which are part of the SpamBayes package. 2.5 Can I use SpamBayes with AOL Mail?You can't use SpamBayes directly with the normal AOL Mail interface, but AOL's Mail system is reputed to allow IMAP access. If that proves to be correct you should be able to use the SpamBayes IMAP filter (sb_imapfilter.py) to scrub your AOL mailbox. Should you try this, reports of your successes or failures on the spambayes@python.org mailing list would be appreciated. 2.6 How do I configure Eudora for use with SpamBayes?Note: The following instructions have been verified using Eudora 5.1 under Windows. If anyone is using Eudora under Mac OS X please let us know if the configuration is the same as Windows. Eudora does not allow configuring the server port through the normal options dialogue. However a large number of options are exposed in an initialization file (eudora.ini) read at startup. The contents of the initialization file are documented by clicking on Help->Topics and searching on EUDORA.INI (you may want to print this help page for future reference.) Depending on how you installed Eudora, eudora.ini is located either in the Eudora install directory or the user's setting directory, e.g.: C:\Documents and Settings\userid\ApplicationData\Qualcomm\Eudora\eudora.ini
2.7 How do I set up Eudora Filters to distribute mail to separate mailboxes?Note: the following instructions have been verified using Eudora 6.0.2, Sponsored mode under Windows, and Eudora 6.1, Sponsorede mode under Mac OS X.
2.8 Will SpamBayes work with Yahoo! Mail?If you subscribe to Yahoo! Mail Plus, you have the option of accessing your Yahoo! mail via the POP3 protocol. SpamBayes has a POP3 proxy application called sb_server that will allow you to filter any POP3 mail account using a POP3 client of your choice (Outlook, Outlook Express, Netscape, Mozilla, Eudora, etc.). If you use the free Yahoo! Mail service, then Yahoo no longer provides POP3 access. However, there is an open-source program called YahooPOPs! that can be used to provide a POP3 interface to the free Yahoo! Mail service. See the YahooPOPs! home page for more information. 3 Outlook Plugin3.1 What version(s) of Outlook does the plugin work with?To our knowledge, the current version of the plug-in should work with Windows 98 and above and Outlook 2000 or above. You may be able to get the plug-in to work with Windows 95 if you install the most recent version of Internet Explorer possible, but we are not certain about this. The troubleshooting guide for the Outlook plugin contains the most up-to-date help for working around known problems. 3.2 Do I have to have Python installed to use SpamBayes with Outlook?If you use the Outlook plugin binary installer there's no need to explicitly install Python. 3.3 Will SpamBayes work with Outlook connecting to an Exchange server?Yes. The SpamBayes Outlook plug-in simply watches the folders that you have instructed it to for new mail. When new mail is received, Outlook informs SpamBayes, which then scores the message and performs the actions you have asked it to, depending on the message score. Thus it isn't involved in the delivery of mail, and so has no idea that it is coming from Exchange. 3.4 Can mail marked as spam automatically be marked as read?Yes. You can find this on the filtering tab of the SpamBayes manager dialog. However, you should also see the envelope icon question. 3.5 Can I back up the Outlook database? Should I do this?Yes you can do this, and it is a good idea if you don't keep copies of the mail that you have trained. This way, if your database becomes corrupted, you will be able to recover without losing all your data (if you do keep copies of mail that you train, you can simply recreate the database via the SpamBayes manager dialog). To backup the databases, all you need to do is make a copy of the directory that SpamBayes keeps its data in. If you need to, you can drop the files back in to recover from a corrupted database, or for any other reason. This directory is located in the "Application Data" directory. You can locate this directory by using the Show Data Folder button on the Advanced tab of the main SpamBayes manager dialog. If you need to locate it by hand, on Windows 2000/XP, it will probably be: C:\Documents and Settings\[username]\Application Data\Spambayes With Windows NT, it will probably be: C:\WinNT\profiles\[username]\Application Data On other versions of Windows, it will probably be: C:\Windows\Application Data\Spambayes Note that the 'Application Data' folder may be hidden. 3.7 I get a message that says I need to "enable SpamBayes", or the enable button is greyed out.To activate SpamBayes, you need to tick the "enable filtering" button in main SpamBayes dialog. Note that this button will initially be greyed out; you need to have done these things to enable the button:
3.8 How can I get rid of the envelope tray icon for spam?This is a very difficult thing to do, because Outlook does not expose the hooks that are necessary to cleanly do this (feel free to write to Microsoft and tell them that they should correct this). This means that even if you have set SpamBayes to mark spam as read, the envelope tray icon will not vanish. Although there is code available that provides a method to delete this icon, it doesn't let us determine whether there is other unread mail as well, which means that we do not know whether we should delete the icon or not. Until someone comes up with a clever solution for all of this, you'll have to put up with the little envelope, sorry. Note that there is a feature request already open for this (RFE 774978), which you may add to if you have comments to make. 3.9 How can I configure SpamBayes to delete spam rather than moving it?Sorry, but you can't. However, Outlook has an excellent "auto-archive" facility which can be used to the same effect - simply configure auto-archive to periodically delete your Spam folder. It is recommended that you configure auto-archive to keep at least a few days of Spam around, should the SpamBayes database become corrupt and require you to perform a full re-train. 3.10 Will "Show Spam Clues" notify a spammer that I opened their message?We think not (but we don't have the source code to Outlook to check for sure). In general, there are two ways spammers can determine this; the first is via an automatic 'Read Receipt' (but this is unusual, as the "from" address is generally forged so the receipt goes nowhere useful). The second common trick is by sending HTML spam that references unique URLs (generally images) on the spammer's server. When this HTML message is rendered, the fetching of these URLs identifies to the spammer the associated email address. As far as we are aware, SpamBayes does not generate a "read receipt" for "Show Spam Clues", even though the message is marked as read (and we don't know how to prevent it being marked as read). However if you are concerned about this, you are encouraged to configure Outlook's generation of receipts manually. Further, as "Show Spam Clues" does not attempt to render the HTML nor otherwise contact any of the URLs in the message, this will not register a hit on the spammer's server. 3.11 Why can't I set spam to be moved to the Deleted Items folder?The problem with this is that you can also set SpamBayes to train all messages moved to the designated spam folder. If you set the deleted items folder as the spam folder (early versions of the plug-in allowed this), then all messages that you delete would be trained as spam. To get this restriction removed, you'll have to convince the developers that there is a way to do this without confusing people - for example, if we let you choose the deleted items folder as the spam folder, only if the 'incremental training' option was off, people would get confused about why it sometimes works and sometimes doesn't. Note that Outlook 2003 has a "Junk Mail" folder that has many of the deleted items folder's properties, and you can get SpamBayes to move spam to this folder. You may also find some good advice in the answer to the question about getting SpamBayes to delete spam. 3.12 Some of my mail is going missing!The plug-in can not delete your mail - if it arrived in Outlook, then either it's still there, or something else removed it. Check that the mail isn't in your 'Spam' folder or your 'Unsure' folder, and that you are checking the correct folders (i.e. go to the 'Filtering' tab of the SpamBayes dialog, and click on 'Browse' next to the two movement options and check that you're moving to the folders you think that you are). You can try to search for the mail, too, using Outlook's 'Advanced Find' (in the 'Tools' menu). Even if you don't know the subject of the 'missing' mail, you can search for all messages that have arrived since a certain date or time. Sometimes mail is in the Inbox (or another folder), but not visible. This, too, isn't SpamBayes's doing. You can set Outlook up to only show a subset of the messages in the folder, with a "View Filter". The Outlook documentation has more information about this (although the 'Advanced Find' will still show the message). Note that if you are using sb_server to receive your mail, then (1) you're reading the wrong section of the FAQ <wink>, and (2) there is a slim chance that SpamBayes is at fault - please use the normal bug reporting procedures. 3.13 Help! I deleted the Unsure/Spam folder.Firstly, if you haven't emptied your Deleted Items folder, you can probably find the unsure/spam folder in it and recover it that way. The little plus next to the Deleted Items folder will reveal it (if there's no plus, there's no folder to recover), and you can drag it to where it belongs. Otherwise, you need to open up the SpamBayes Manager dialog, click the "Filtering" tab, and reenter the folder selection (you might need to create a new unsure/spam folder first - just like you'd create a folder normally). The dialog will probably have "<unknown folder>" in the place where the unsure/spam folder used to be. Just click Browse and select the correct folder. This should always work. If you're really stuck, though, you can reset your configuration completely (you'll have to redo any setup, but not any training) by deleting the appropriate .ini file from your Data Directory. You want to remove the .ini ("Configuration file") that has the same name as your Outlook profile (for some, this will be something generic like "Outlook.ini", or "Outlook Internet Settings.ini"). Just remove or rename this file and your configuration will be reset. Then select the folders as you would normally. 3.14 How can I change the directory that SpamBayes stores my data in?Instructions for doing this can be found in the "Configuration Guide". (You get to this by doing SpamBayes->Help->About SpamBayes, which opens up a browser window, then clicking the "Configuration Guide" link. The appropriate section is headed "Multiple Configuration Files" and is at the end of the document). Basically, you need to create a file "default_configuration.ini", and put it either in the bin directory in the directory that SpamBayes was installed into, or in the default data directory (the backup question has instructions for finding this directory). Inside this file, you need to have a section "General", and an option "data_directory", which point to the new location that you wish to use. For example: [General] data_directory=c:\spambayes_data 3.15 The "Recover From Spam" button no longer appears.First, check that Outlook isn't hiding it for you. If toolbars run out of space, Outlook automatically hides buttons for you (we have no control over which buttons are hidden). If this has happened there will be little down-arrows at the end of the toolbar - click those and the missing button will appear. To stop this happening, you have to move the SpamBayes toolbar to somewhere where there will always be room for the three buttons. Next, check that you're in the right folder. This button only appears when you are in the 'unsure' and 'spam' folders. Open up the SpamBayes Manager dialog, and go to the Filtering tab. Click the browse button next to the 'unsure' and 'spam' folders and write down the hierarchy to get to the selected folder (for example, "Personal Folders->Inbox->Possible Junk"). Now, close the Manager dialog, and, in Outlook, navigate to the folders based on the hierarchy that you wrote down (to make sure that you aren't in another folder with the same name). If neither of these works, then please submit a bug report, attaching a copy of your most recent log file. 3.16 How do I uninstall the plug-in?Note that if you simply want to disable the plug-in for a while, you can do this by unticking the "Enable SpamBayes" box on the front page of the "Manager" dialog. If you installed the plug-in from source, then you simply need to re-run the "addin.py" script, with the parameter "--unregister", and then delete the toolbar in Outlook. Otherwise, you uninstall the plug-in like you uninstall (almost) any other Windows program, using the "Add/Remove Programs" Control Panel. You'll see the plug-in listed under "S" for "SpamBayes". This will stop remove all the program files that the plug-in installed, and stop the plug-in from working. It would be a good idea to do this while Outlook is not running. Note that this does not remove your personal setup files (deliberately). This includes your databases and configuration files. This means that if you later reinstall the plug-in (or a later version), those files will still be there, ready to use. If you wish to remove these as well, you can remove them like you would delete any other file (move them to the Recycle Bin). The backup question explains where you can find these files. Note that a bug with the plug-in means that the SpamBayes toolbar is not deleted on uninstall - although it will stop working. You can delete this yourself in Outlook: right-click on the toolbar, choose "Customize", then select the SpamBayes toolbar and click "Delete". 3.17 Is it OK to delete the messages in the Spam folder?If SpamBayes identifies a message correctly then it is fine to delete it. After a message is identified, SpamBayes only uses the message if you need to train an unsure or incorrect identification. It is also safe to delete any messages that you have already trained on using the "Delete as Spam" and "Recover from Spam" buttons. All the information that SpamBayes needs to know about trained messages is stored in a separate file. The information in the non-Outlook version of this question also applies, and may be of interest. 4 Using SpamBayes4.1 Does SpamBayes work with non-English languages?SpamBayes was developed by English-speaking people and has therefore had very little testing with other languages. There are some anecdotal reports that it doesn't work as well with Western European language. It might work very well with them if these default values are changed in the user's ini file (note that for Outlook users, this means the default_bayes_customize.ini file, rather than the one called Outlook.ini, or named after your profile): [Tokenizer] replace_nonascii_chars: True skip_max_word_size: 12 The first setting causes all non-ASCII characters to be replaced by a question mark. For non-English languages the setting should probably be False. The second setting causes all words longer than 12 characters to yield a "skip: X NNN" token instead of the word itself, where X is the first letter of the word and NNN is the word length. For languages like German, this can be especially troublesome, because an inordinate number of words will yield tokens like "skip: ? 17" because they are long and start with an accented character. Asian languages will be particularly troublesome. The SpamBayes tokenizer splits the message into whitespace-separated tokens. (Many?/Most?/All?) Asian languages don't separate "words" with whitespace, so the entire body of a message will generate little other than "skip: ? NNN" tokens. 4.2 How do I train SpamBayes (web method)?Follow the "Review messages" link and you'll see a list of the emails that the system has seen so far. Check the appropriate boxes and hit Train. The messages disappear and if you go back to the home page you'll see that the "Total emails trained" has increased. Once you've done this on a few spams and a few hams, you'll find that the X-Spambayes-Classification header is getting it right most of the time. The more you train it the more accurate it gets, but not that you should try to train it on about the same number of spams as hams. The SpamBayes wiki has some information about training that you may wish to read. You can train it on lots of messages in one go by either using the sb_filter script as explained in the "Command-line training" section, or by giving messages to the web interface via the "Train" form on the Home page. You can train on individual messages (which is tedious), using mbox files or using Outlook Express dbx files. 4.3 How do I train SpamBayes (forward/bounce method)?Alternatively, when you receive an incorrectly classified message, you can forward it to the SMTP proxy for training. If the message should have been classified as spam, forward or bounce the message to spambayes_spam@localhost, and if the message should have been classified as ham, forward it to spambayes_ham@localhost. You can still review the training through the web interface, if you wish to do so. You should ensure that the "lookup message in cache" option is set to True/Yes before you use this. Note that some mail clients (particularly Outlook Express) do not forward all headers when you bounce, forward or redirect mail. We do not recommend using the SMTP proxy with these clients. 4.4 How do I train SpamBayes (command line method)?Given a pair of Unix mailbox format files (each message starts with a line which begins with 'From '), one containing nothing but spam and the other containing nothing but ham, you can train SpamBayes using a command like: python sb_mboxtrain.py -g ~/tmp/newham -s ~/tmp/newspam The above command is OS-centric (e.g., UNIX, or Windows command prompt). You can also use the web interface for training as detailed above. 4.5 How do I train SpamBayes (Outlook plugin)?Instructions about training the Outlook plugin can be found in the documentation for the plugin, and the 'Configuration Wizard' will attempt to guide you through an initial training process. Basically what you need to do is move as much spam as you have into your spam folder, tell the plugin which folder that is and which folders contain examples of ham, and it will do the rest. The plugin does not train on all incoming mail. However, if you use the "Delete as spam" and "Recover from spam" buttons, those messages will be (re)trained as necessary. If you have set it to use incremental training then it will also train on messages which are manually moved into the spam folder and those folders that you are 'watching'. 4.6 Do I need to keep spam after it has been trained? If so, for how long?Once a message has been [correctly] trained there is no need to keep it around. However, SpamBayes' accuracy is dependent upon having a "sufficient" sample from which to make its decisions. Therefore, most users retain a fair amount of spam in the event that they may wish to rebuild the corpus from scratch. Of course, this begs the question: "how much is enough?" That is where the "art" of SpamBayes meets the science. Some users keep as many as several thousand [recent] spam (as well as a similar number of ham). That is not to say that you won't have excellent results with a tenth (or less) of that number; since everyone's e-mail profile is different, the requirements for training are as well. 4.7 Why did SpamBayes mark this obvious spam "unsure"?It may be obvious to you that the message is spam, but the classifier only works on the information it has been given. Maybe this is "new" (you've never seen this particular flavor of spam before), or maybe there aren't enough clues in the message which the system is aware of as strong spam clues. You should look at the clues that SpamBayes generated, and that should give you an idea of the reason for the classification. Both the web interface and the Outlook plug-in let you view the clues that make up the message. If you still can't figure out the reason why, you can ask the mailing list for advice - but make sure you include the spam clues/tokens listing in your message! 4.8 OK, I trained on that message, but it still thinks it's unsure.It didn't, but you may need to train on a few more of this type of message to get it classified as "spam". The classification algorithm weights its results based on the number of times it has seen a particular clue, so that clues unique to this type of message may need a few more instances to become "convincing". 4.9 SpamBayes doesn't seem to catch much spam. What gives?Initially, SpamBayes will not be able to distinguish spams from hams. With no training inputs, the classifier will simply mark everything unsure. Once you start training the classifier on a representative set of spams and hams it should very quickly begin to improve, however. If that's not the case, perhaps you have something misconfigured. Here are a couple things to check:
4.10 How do I start from scratch after messing up my training?If you're using the Outlook plug-in, you can simply use the "Training" tab of the SpamBayes Manager, and tick the "Rebuild entire database" box. Otherwise, because training from scratch is a very rare occurrence, and as deleting all your training information is something you don't want to do by accident, there isn't an option for this. However, you can quite simply do this manually. All the training data is stored in a file, usually called hammie.db, and if you delete (or rename) this, then you will start training from scratch. If you are using the web interface for the POP3 proxy, the configuration page tells you what this file is called (and where it is) down towards the bottom of the page. 4.11 How do I configure SpamBayes?To configure the Outlook plugin, you should choose SpamBayes Manager from the SpamBayes button on the SpamBayes toolbar. If you use the POP3 proxy or IMAP filter, then simply open a browser window to http://localhost:8880, click on the configuration link on the top right of the page that opens up, and fill in the details. With the POP3 proxy, when you need to select local port numbers to proxy on, if you are only proxying one server, then try 110 first. If that doesn't work, or you are proxying multiple servers, then try higher numbers, such as 8110, 8111, 8112, and so on. If you're using the POP3 proxy, you'll also need to configure your email client to talk to the proxies instead of the real email servers. Change your equivalent of pop3.example.com to localhost (or to the name of the machine you're running the proxy on) in your email client's setup, and do the same with your equivalent of smtp.example.com. Hit "Get new email" and look at the headers of the emails (send yourself an email if you don't have any!) - there should be an X-Spambayes-Classification header there. It probably says "unsure", if you haven't done any training yet. You should be able to create a mail folder called "Suspected spam" and set up a filtering rule that puts emails with an "X-Spambayes-Classification: spam" heading into that folder. Otherwise, the system is configured through a file called bayescustomize.ini or .spambayesrc. In here you can configure the name and type of your database, your ham and spam cutoffs, and so on. The default values for all the options, and the documentation for them, live in Options.py. To change an option, create a bayescustomize.ini and add the option to that - don't edit Options.py. This is in the 'standard' ini file format (originally created for Windows 3.1, I believe). You can find documentation on this format in the ConfigParser docs, but basically, it's just a text file: lines beginning with # are comments, sections start with a line like "[Section Name]", and options are set out within the appropriate section with lines like "opt = val" or "opt: val" (either is ok). Whitespace other than line endings is for the most part ignored, so you can make it look like whatever you like. You can see a list of what a configuration file of all the defaults would like like if you execute the following Python command (the wrapping here is just for display - this should all be a single line): python -c "from spambayes.Options import options ; print options.display()" 4.12 Now I know what the format looks like, but what options do I need to set?This depends on exactly what you want to do, and which application you are intending to use. The easiest thing is to execute the following Python command (the wrapping here is just for display - this should all be a single line): python -c "from spambayes.Options import options ; print options.display_full()" This will print out a complete list of the options, including a description of the option, and its default value. You can also look up a single section, if you know its name: python -c "from spambayes.Options import options ; print options.display_full('section_name')" Or just a single option: python -c "from spambayes.Options import options ; print options.display_full('section_name', 'option_name')" If you want a list of all the sections, you can use this command: python -c "from spambayes.Options import options ; print options.sections()" If you want a list of all the options, you can use this command: python -c "from spambayes.Options import options ; print options.options(prepend_section_name=False)" 4.13 Why is SpamBayes ignoring my configuration file?SpamBayes looks for your configuration file in three places - if it can't find it, then, obviously, your options will not be loaded. The first place that SpamBayes checks is the environment variable BAYESCUSTOMIZE. You can set this to the path of your configuration file, wherever it is, and it will be loaded. You can also specify more than one file, separated by the appropriate path separator for your platform. This is the recommended method of specifying the location of the file, unless you do so via a user interface (as provided by the POP3 proxy, the Outlook plugin, and the IMAP filter). If SpamBayes doesn't find anything in the BAYESCUSTOMIZE variable, then it checks the current working directory and your home directory for a bayescustomize.ini or .spambayesrc file (respectively). 4.14 Why don't short words or long words show up in the clues?Words less than 3 characters long are skipped, and words greater than 12 characters long are converted into a special 'long-word' token. These numbers (3 and 12) were determined by brute force testing, and produced the best overall results (including compared to no upper or lower limits). 4.15 I'm not a programmer, but want to help out - what can I do?Fantastic! There are four main ways to contribute (including programming):
Sadly, not much is done in the way of testing these days. Hopefully this will change, though, and if you're interested it's definitely an option. Check out the README-DEVEL for information about how to get started. This is the way to go if you have a new idea, too - even if you convince someone else to develop it, we'll expect you to put in time to test its effectiveness. Support is always helpful - especially since it can often save the developers from answering questions, which leaves more time to add features/remove bugs. Which leaves documentation - this is always in need of work. Take a look at what there is and see what you can improve, or ask on the list for advice about which files need updating most urgently. If you want to contribute to this, the easiest thing is to work on a current copy of the documentation (unless it's a new piece) and then submit it to the list or via the sourceforge patch system. One of the developers will go over it and check it in (although please be patient - sometimes it may take a while for them to have the time to go through it). 4.16 Is there anything else I should know?While SpamBayes does an excellent job of classifying incoming mail, it is only as good as the data on which it was trained. Here are some tips to help you create a good training set:
4.17 Can SpamBayes be used to perform n-way classification?In theory, yes it can, though this has not yet been tried. There are a couple other tools, POPFile and CRM114. A demonstration script which performs n-way classification in also in the contrib directory of the SpamBayes source. 4.18 How do I use a pickle for storage?If you don't want to use one of the dbm methods for storage (if you only have dumbdbm, for example), or one of the SQL methods, you can use a pickle (of a giant in-memory Python dict). A pickle will be relatively small, but slower, compared to using of the the dbm storage methods. To use a pickle, set the option "persistent_use_database" to False in your configuration file, in the section "Storage". You may also wish to change the name of the storage file (to end with "pck", for example), but this is not necessary - to do so, change the "persistent_storage_file" option (also in the "Storage" section). If you specify your database on the command line ("sb_server.py -d hammie.db", for example), then you should use the "-p" switch instead. Note that if you have an existing database, which is not a pickle, you can not keep using it - this will cause errors. You need to either retrain from scratch, or use the sb_dbexpimp.py script to convert it to a pickle. 4.19 How can I access the web interface from a different machine than the one it is running on?By default, the web interface rejects browser access unless the browser is running on the same machine as the interface - if it was open to anyone, then people who came across your machine would be able to change your settings and possibly read your mail. In some cases, however, you might want to open access up - if you use more than one machine to process mail, for example. You can specify IP addresses or ranges that you want to be allowed access (two or three machines, for example), via the web configuration. The option you are after is called Allowed remote connections. You can also set the interface to use HTTP-AUTH, either Basic or Digest. 4.20 I've just installed SpamBayes, but when I run it I get an access denied error.If you use sb_server or sb_imapfilter, haven't set anything up, and run it and get an error that ends with socket.error: (10013, 'Permission denied'), then this probably means that port 8880, which SpamBayes is trying to use to present the web interface, is already taken on your machine. Try using sb_server.py -u 8881 -b (or sb_imapfilter.py -o html_ui:port:8881 -b), or another port that you know is free and available on your machine. 4.21 How do I set up SpamBayes and Outlook Express?
Everything should now be setup. Try doing a send/receive - mail should arrive as normal, but any mail that SpamBayes is unsure about will have 'unsure,' (1.0.x) or 'unsure@spambayes.invalid' (1.1.x) in the recipient list, and any mail that SpamBayes thinks is spam will have 'spam,' (1.0.x) or 'spam@spambayes.invalid' (1.1.x) in the recipient list. You can use Outlook Express's Rules Wizard to create rules that automatically move these messages to other folders, for example:
Mail will now be split between your Inbox, the Possible Junk folder, and the Junk Mail folder, depending on how it was classified. You do training by double-clicking the envelope icon and filling out the review page that opens. Note that there is a flaw in this method: if you get mail from someone who has "unsure" or "spam" in their email address, those messages will also be moved. (The problem comes about because Outlook Express is so limited in the filtering it can do). There is a way to workaround this, so ask the mailing list if it's a problem. When the final 1.1 release is made, it will avoid this problem (instead of just "unsure", it adds "unsure@spambayes.invalid"). If you have any more queries, please look through the rest of this FAQ, and if you can't find the answer, ask the mailing list. 5 Known Problems & Workarounds5.1 My database keeps getting corrupted.Despite the efforts of the developers, there are still occasional problems with database corruption. Known potential causes include:
2. Interupting SpamBayes in the midst of training (through a program or machine crash, for example). If you experience consisent corruption, or can provide a set of steps that will consisently cause the database to be corrupted, please email the mailing list, describing your situation. Otherwise, you should simply retrain from scratch. You may wish to change to an alternative database system to try and avoid these problems. If you are not sure which database systems you have available, and/or which one you are currently using, there is a script in the utilities folder called which_database.py that will display this information (Windows users should run it from a command prompt). Note that users of the pop3proxy_service can not currently use which_database.py. 5.2 I get a "DBRunRecoveryError" message.If you get a message that looks like: DBRunRecoveryError: (-30982, 'DB_RUNRECOVERY: Fatal error, run database recovery -- fatal region error detected; run recovery'), this, sadly, means that your training database is corrupted, and you have no choice but to delete it and train again from scratch. We don't know what causes this to happen, but we are trying to fix it. If you find it happens reliably for you (ie. the problem always comes back after deleting the database and retraining), please post a message saying as much to spambayes@python.org - no-one has yet found a case where we can reliably reproduce the problem, so tracking it down is proving very difficult. Note that the "database recovery" that you are told to run does not apply. This is a message provided by the underlying bsddb database system, and cannot be used in this case. If you don't want to risk it happening again, switch to using the pickle storage (web interface: Configuration / Advanced Configuration / Storage Options / Use database for storage: No) instead. Note that users of the 1.0.x Outlook plug-in can only change their database type if they are running from source - however corruption is extremely rare with the plug-in, so this should not be a problem. Note, however, that there are two issues with this:
You may also wish to read the my database is corrupted FAQ. 5.3 Why does the spambayes@python.org mailing list get spam?A reasonably small amount of messages posted to spambayes@python.org are spam, bounce messages, out-of-office messages, and so on. People often wonder (and ask) why we don't filter these out, or require all posters to be subscribers to the mailing list, or moderate the list. No filtering is done because spam is often discussed on the list. For example, you can send in a spam message that you received, with the clues that it generated, and ask why it scored what it did. Or you might want to give an example of a new technique that spammers are using, and ask how well we think that SpamBayes will handle it. It's very difficult to distinguish between this mail and legitimate mail, and we don't want any false positives (this is a small amount of traffic, after all). We don't require all posters to be subscribers because this is hard on one-time posters, who probably start three-quarters of all threads, and account for about half the messages. These people often want help with setting up SpamBayes, and are sometimes not particularly comfortable with computers, so the fewer loops to jump through to get help, the better. It would also defeat the "submit bug report" ability of sb_server (which ensures that we get enough information), which is also likely to appear in the Outlook plug-in. We don't moderate the list, because we're too busy to give timely answers to everyone as it is, let alone continue developing and testing. If you understand the time commitment involved in moderation, and can do it 24 hours a day, 7 days a week (we're an international bunch around here), then please let us know, and we can give this a shot. None of this should bother you, though. Why? Because if you set SpamBayes up, and train it appropriately, you'll see almost none of these messages. Just about all the developers do this, and we don't see any of these messages (often the first time is when someone quotes it asking why it got to the list), unless we're checking through caught spam. There's a higher than normal chance of a false positive (those messages about spam), but there's a reasonable chance you don't care about missing those anyway, and, at least in our experience, SpamBayes does a pretty good job at correctly classifying those too (given good training). 5.4 I already have a POP3 proxy so how can I use sb_server?The solution here is to chain the proxies together. SpamBayes (sb_server) doesn't really care where in the chain it is, although some of the other proxies (often anti-virus software like Norton Anti-Virus or AVG) sometimes do. The easiest solution is to leave your other proxy set up exactly as it was before SpamBayes. Then look in your email client to see what port it is using (it'll probably be connecting to "localhost" or "127.0.0.1"), and set SpamBayes to collect mail from localhost on that port, rather than from your mail server, and forward to localhost (on any free port). This means that mail arrives at your mail server, then goes through your other proxy, then through SpamBayes, then arrives at your mail client. This has been found to work with AVG, for example. Some proxies, however, may force your mail client (e.g. Outlook Express) to get mail from a particular place (IIRC, some flavours of Norton do this). In this case, you need to leave your mail client set up as it is, and change the settings of your other proxy instead. So get the proxy to get mail from localhost (on any free port) and have SpamBayes get mail from the mail server and forward to localhost (on the port you set up the other proxy goes through SpamBayes, then through your other proxy, then arrives at your mail client. You should be able to chain more than two proxies together with a similar process, if necessary. How do you know which one you should use? Trial and error, basically. I'd suggest trying the first solution first, as it is the most straightforward, but if you find that your mail client keeps 'magically' changing back to the original settings, you'll probably need to use the second one. If you can't manage to get either one working, be sure to email the mailing list asking for help - with any many details as possible, including what you have already tried. 5.5 When viewing the sb_server review page, my computer becomes unresponsive.Are you using ZoneAlarm? There is a known issue with version 5.x of ZoneAlarm (specificially its TrueVector service) that causes problems with the SpamBayes review page. We're told that the ZoneAlarm technical support people recommend a clean uninstall of ZoneAlarm and reverting to version 4.5.594.000 of ZoneAlarm, which works fine with SpamBayes. We're also told that ZoneAlarm are aware of the issue with their 5.x release and were working on it, but that it might be the next release before they have it solved. 5.6 I get an error message "No filterable messages are selected".This applies to the Outlook plug-in only. SpamBayes only lets you train on messages that have been received (these are the only messages that should be trained on). This means that you cannot train on sent messages, drafts, notes, calendar items, tasks, and so on. To check whether a message has been received, SpamBayes checks some of the Outlook properties for the message. Very seldomly, these can result in a false classification, where the message has been received, but SpamBayes does not believe it has. The best move here is to simply move the message yourself. If this is a recurring problem, please add comments to the appropriate SourceForge tracker. Note that one cause of this problem is that with some versions of Outlook and Outlook Express, moving mail from Outlook Express to Outlook will strip the mail of all Internet headers, which means the messages are not able to be filtered/trained. However, this is not a problem with SpamBayes - you can either work around the export/import problem, or simply not use those messages for training (we do not recommend pre-training in bulk, in any case). 5.7 Messages don't move after clicking until I change folderThis applies to users of the Outlook plug-in and Exchange: what you will see is that if you click the "Delete As Spam" or "Recover From Spam" button the message won't move until you refresh the Outlook display in some manner (change folder, select another item, etc). Actually, the message is moved immediately - it's the Outlook display that doesn't get updated, so it just looks like the message hasn't moved. We believe that this is an Outlook problem that we are triggering (but we do not know of any way to work around it. Please let us know if you have more information than we do!). If you are using Exchange 2003 and Outlook 2003 you can enable 'cached mode' and that typically fixes it. We suspect that this applies to particular combinations of Outlook and Exchange versions. It seems to be newer versions of Outlook (earliest seen so far is Outlook 2002 SP2) that are affected. For the moment, it appears that this is a problem with Outlook, and that the most likely solution will be in an update to Outlook from Microsoft. Please don't report this as a new bug. However, if you have any additional information about this situtation, we would love to hear it. 5.8 After installing SpamBayes, Outlook crashes and then asks for the plug-in to be disabled.Are you using an Athlon 64 or Core 2 Duo with DEP? There are issues with DEP and Outlook with a SpamBayes-based plug-in. Listing Outlook as a safe application on these processors should "solve" the problem. 6 Development6.1 Why don't you implement cool tokenizer trick X?Have you run your tokenizer trick against a large set of test messages to see if it actually works? Many times what seems like a good idea turns out not to help much, and sometimes even hurts. If you have a good idea, you've run it against a batch of messages and can prove that it helps, paste the code for your technique and the proof to the mailing list. If you're not a coder, but are really keen on your idea, post a feature request on the project page, and wait for someone else to code it for you (but make sure you do some testing when it's done). Otherwise, you will likely get a message from Tim Peters about why you need to test your idea :) Note that as a general rule, we've found that with the tokenizer, "stupid beats smart" - that is, very specialized tokenizer behavior usually produces worse results than a more general approach that just generates tokens and throws them at the classifier. See also the file NEWTRICKS.txt in the source distribution - we're filing neat ideas here, and also check out the wiki. If you're interested in trying out other people's cool ideas, as well as your own, then check out the current experimental options (these start with "x-", and are available via the web interface on the Experimental Configuration page) and give us some feedback about how they work for you. 6.2 Are there plans to develop a server-side SpamBayes solution?The problem with a server-side solution is that everyone has a different idea of what is spam - that's the whole strength of the bayesian-style filtering concept. If you are certain that all of your users would agree on what is spam and what is not, then this might work for you, but otherwise you really have to have individual databases for each user. Either way, you should be able to modify SpamBayes easily enough to fit into your setup. Some people have in fact done this and have been kind enough to donate notes about how they have gone about it. If you also do this but in some other way, please let us know so that we can add to the information. 6.3 Forget tokenizing words - you should use character n-grams!This was quite carefully tested. Character 3-grams gave five times as many false positives, and twice as many false negatives as splitting on whitespace (words). Character 5-grams came fairly close to words with false positives, but the number of false negatives was worse than with 3-grams. n-grams also creates many more unique tokens, which means much slower operation. In addition, it's much harder to figure out why a message scored as it did with n-grams. On the other hand, words are easy to understand. There was, however, one area where n-grams were much better: detecting spam in Asian languages. Since a 'word' in an Asian language message ends up being an entire line, words don't work very well at all. 6.4 Why do you force all tokens into lower case?This was very carefully considered. Folding letters to lower case does hide information (and we're not really sure what it does to non-English languages), but on the plus side, it reduces the size of the database. In the end, testing with case folding resulted in no change in the false positive rate, and a small reduction in the false negative rate, so that's what we do. There is one exception: we retain case in subject lines, because testing showed an improvement if we did that. 6.5 Why can't I bounce spam back to the sender?Most spammers these days don't accept incoming email, or (worse) forge the From and sender addresses, it's unlikely that it would do any good, and may well do some innocent much harm. 6.6 Why don't you add whitelisting/blacklisting to SpamBayes?The main reason is that for the most part SpamBayes doesn't need it! As long as you keep training on unsure or mis-classified mail, SpamBayes will learn what you consider good mail without needing any specific lists. In addition, tokens are generated from email addresses, so an automatic 'whitelist' (of sorts) is generated, as is a similar blacklist. Whitelists and blacklists are problematic anyway, because 'spoofing' (pretending you are someone else) is reasonably simple, and also very common. So, more often than not, they'll lead to incorrect results. However, there are some commercial products based on SpamBayes that offer whitelisting - see the related page for more information. Also, blacklisting is really a server side responsibility. SpamBayes is a content filter - it looks at what is inside the "envelope". Blacklisting, DNS based spam handling like rejecting mail without valid origin or from a known spam source is really the job of the mail server. In an ideal environment such mail will be rejected before it reaches you as it deals with what's "written on" the "envelope". Applying content based filtering on the server is complex, as everyone's feeling about content differs - this is why it is a client end role that tools like SpamBayes fill. If you really need whitelisting, consider implementing rules in your mailer to intercept the messages before they're passed to SpamBayes. Open Source software is developed by people scratching their own itches (and it can't be otherwise, since nobody is paid to endure things they don't want to do). While few, if any, of the SpamBayes developers would object to adding whitelist gimmicks, none of the existing developers have an interest in implementing them, typically because they'd be a net loss for them (and so wouldn't use them). Developers may have a different view of this than most users - because they work on open source software, they know a lot of other open source developers, and their email addresses are all over the web. As a result, they get a lot of spam, and especially viruses, claiming to be sent from people they know (including direct coworkers, bosses, and the company president). We even get viruses and spam claiming to come from ourselves (but don't remember sending them <wink>). So what SpamBayes does now is exactly right for us: the person a thing claims to come from is a clue, but just one clue, and is tossed in the pot with all the other clues. If someone contributed code to do it, it would probably get added. Note that there are increasing problems trying to access the address book, because every Outlook service pack makes that harder to do (accessing the Windows address book is most associated with viruses). There are three main things that need to be done:
If you are interested in implementing whitelisting, take a look at some comments from Mark. 6.7 What do I need to do to update the FAQ?If you're not a SpamBayes developer simply send your corrections or proposed questions and answers to the SpamBayes developers mailing list. If you are a developer you need a recent version of Docutils and the tools/html.py script from that distribution must be in a directory on your PATH. |