Waleed Eissa
Waleed Eissa

Reputation: 10533

Calculating the probability of a token being spam in a Bayesian spam filter

I recently wrote a Bayesian spam filter, I used Paul Graham's article Plan for Spam and an implementation of it in C# I found on codeproject as references to create my own filter.

I just noticed that the implementation on CodeProject uses the total number of unique tokens in calculating the probability of a token being spam (e.g. if the ham corpus contains 10000 tokens in total but 1500 unqiue tokens, the 1500 is used in calculating the probability as ngood), but in my implementation I used the number of posts as mentioned in Paul Graham's article, this makes me wonder which one of these should be better in calculating the probability:

  1. Post count (as mentioned in Paul Graham's article)
  2. Total unique token count (as used in the implementation on codeproject)
  3. Total token count
  4. Total included token count (ie. those tokens with b + g >= 5)
  5. Total unique included token count

Upvotes: 8

Views: 2120

Answers (4)

Tony Meyer
Tony Meyer

Reputation: 10157

In general, most filters have moved past the algorithms outlined in Graham's paper. My suggestion would be to get the SpamBayes source and read the comments outlined in spambayes/classifier.py (particularly) and spambayes/tokenizer.py (especially at the top). There's a lot of history there about the early experiments that were done, evaluating decisions like this.

FWIW, in the current SpamBayes code, the probability is calculated thusly (spamcount and hamcount are the number of messages in which the token has been seen (any number of times), and nham and nspam are the total number of messages):

hamratio = hamcount / nham
spamratio = spamcount / nspam
prob = spamratio / (hamratio + spamratio)
S = options["Classifier", "unknown_word_strength"]
StimesX = S * options["Classifier", "unknown_word_prob"]
n = hamcount + spamcount
prob = (StimesX + n * prob) / (S + n)

unknown_word_strength is (by default) 0.45, and unknown_word_prob is (by default) 0.5.

Upvotes: 1

Yuval F
Yuval F

Reputation: 20621

This EACL paper by Karl-Michael Schneider(PDF) says you should use the multinomial model, meaning the total token count, for calculating the probability. Please see the paper for the exact calculations.

Upvotes: 2

Jeff Martin
Jeff Martin

Reputation: 11022

you may want to look at PopFile, a time tested perl implementation. It does a very good job. I am pretty sure it is open source and you could see what formula they use.

Upvotes: 0

Simeon Pilgrim
Simeon Pilgrim

Reputation: 26043

Can you alter your code to use the other methods? Then you could test with a different data set, and post the results.

Upvotes: 0

Related Questions