Filtering messages using Bayes Algorithm

Author: Pablo Fischer
Date: July 30 2005

What is it about?

Now days is very common to receive tons of messages in our mailboxes and sadly 50% (or more) of these messages are classified as Spam. Currently there are many options for mailboxes to filter spam and most of these funny scripts are based in probability algorithms, the most common of these algorithms is the Bayesian algorithm.

But we have these options/scripts in our Blogs? Blogs?. Yes, now days spammers attack our comments selling us cheap viagra or sending us offers to see Britney Spears nude and many other messages.

Thats why I decided to search in google for some Bayesian scripts written in PHP. I’ve been testing a class that Loc d'Anterroches wrote. Now that I’m sure it works I’m implementing it on Jaws.

How to mix PHPNaiveBayesianFilter with Jaws?

First of all let me write the SQL Schema (idea) of the Bayes implementation in Jaws. It consists in four tables:

  • Bayes_Categories: This will have two categories: Spam and NoSpam.
  • Bayes_References: It will have the complete text of each message for further classification. The messages here can be deleted in the future to reduce space in the Database.
  • Bayes_WordFreqs: It will have the most common words (if not all of them) of each message that has been to references and with its classification (spam or nospam).
  • Bayes_SourceReferences: This table is the one that makes Jaws works faster and classify messages, it will have the reference source name and id, for example: Comments and 32, and it will also have the id of the Bayes_Reference id, so in the future one can classify the message and know if its a spam or not. It will be explained below how is it going to work.

How is it going to work?

I think that the best way to a proposal is to explain a problem and how Jaws is going to solve it:

You receive in your Blog a comment that offers you cheap \/ | /\ 4 R /\. 

How is it going to be fixed?, Easy! But first remember that in 0.6 (under development right now) the comments (for Blog, for example) are managed by a new API named JawsCommentsAPI which deals with the relations between entries (of Blog, for example) and comments entries.

So, here is how we are going to fix it:

  1. First we are going to save the comment as we normally do: an INSERT.
  2. If the query was succesfull then we should get the id of it and build the BayesSpam object.
  3. Once we have the entry id and the BayesSpam object we only need to ‘add it’: $bayes→Add($message, ‘Comments’, $id);
  4. Then BayesSpam is going to take the message, the source name (Comments) and the id (suppouse it is #32) and add the Reference message to Bayes_References and run the algorithm to guess if $message is classified as spam or nospam.
  5. Once it knows the classification it will add a new entry to Source Reference saving the: Reference Text id, the source name (Comments), source id (#32) and the classification (spam or nospam).

Using it sounds great, but how can I train it?

We know that SpamBayes will classify your messages as spam or notspam sometimes wrongly, thats why Jaws will have an UI (User Interface) to train it and decide which messages are spam and which not, but remember than from time to time it’s not going to be necessary.

Also the ‘history’ messages (Bayes_Reference) can be removed constantly once you classify your messages or automatically (once you trust in Bayes). Why can it be removed?, because we only need to know the ‘words’ that are classified as spam and the ones that not, so keeping messages is not necessary.

But how can I know if a Comment message is classified as spam?

Remember that Bayes will have the Bayes_SourceReferences table that has the reference message classification and the source name and id, so to know if a comment is a spam you can do:

foreach ($comments as $comment) {
   if (!$bayes->isSpam('Comments', $id)) {
       //Because this is not a spam, we can print it :-)
   }
}

Whats next?

  • Trust levels: You can configure the Bayes Spam to tell the life-time of messages, so you can ‘delete’ all your messages under Bayes_Reference when you have 20, 30… n new messages in the queue.
  • Mix Bayes with networking so you can block comments from hostnames, IPs, Netmasks, etc.
 
  /var/www/wiki/htdocs/data/jaws/proposals/spambayes.txt · Last modified: 2007/11/02 16:27