Christian
Christian

Reputation: 609

Concept of Naive Bayes for demonstration purposes, how to calculate word possibilities

I need to demonstrate the Bayesian spam filter in school.

To do this I want to write a small Java application with GUI (this isn't a problem).

I just want to make sure that I really grasped the concept of the filter, before starting to write my code. So i will describe what I am going to build and how I will program it and would be really grateful if you could give a "thumbs up" or "thumbs down".

Remember: It is for a short presentation, just to demonstrate. It does not have to be performant or something else ;)

I imagine the program having 2 textareas.

In the first I want to enter a text, for example

"The quick brown fox jumps over the lazy dog"

I then want to have two buttons under this field with "good" or "bad".

When I hit one of the buttons, the program counts the appearances of each word in each section.

So for example, when I enter following texts:

"hello you viagra" | bad

"hello how are you" | good

"hello drugs viagra" | bad

For words I do not know I assume a probability of 0.5

My "database" then looks like this:

<word>, <# times word appeared in bad message>

hello, 2
you, 1
viagra, 2
how, 0
are, 0
drugs, 1

In the second textarea I then want to enter a text to evaluate if it is "good" or "bad".

So for example:

"hello how is the viagra selling"

The algorithm then takes the whole text apart and looks up for every word it's probability to appear in a "bad" message.

This is now where I'm stuck:

If I calculate the probability of a word to appear in a bad message by # times it appeared in bad messages / # times it appeared in all messages, the above text would have 0 probability to be in any category, because:

When I now multiply the single probabilities, this would give 0 in both cases.

Could you please explain, how I calculate the probability for a single word to be "good" or "bad"?

Best regards and many thanks in advance me

Upvotes: 1

Views: 98

Answers (1)

Artem Sobolev
Artem Sobolev

Reputation: 6079

For unseen words you would like to do Laplace smoothing. What does that mean: having a zero for some word count is counterintuitive since it implies that probability of this word is 0, which is false for any word you can imagine :-) Thus you want to add a little, but positive probability to every word.

Also, consider using logarithms. Long messages will have many words with probability < 1. When you multiply lots of small floating numbers on a computer you can easily run into numerical issues. In order to overcome it, you may note that:

log (p1 * ... * pn) = log p1 + ... + log pn

So we traded n multiplications of small numbers for n additions of relatively big (and negative) ones. Then you can exponentiate result to obtain a probability estimate.

UPD: Actually, it's an interesting subtopic for your demo. It shows a drawback of NB of outputting zero probabilities and a way one can fix it. And it's not an ad-hoc patch, but a result of applying Bayesian approach (it's equivalent to adding a prior)

UPD 2: Didn't notice it first time, but it looks like you got Naive Bayes concept wrong. Especially, bayesian part of it.

Essentially, NB consists of 2 components:

  1. We use Bayes' rule for a posterior distribution over class labels. This gives us p(class|X) = p(X|class) p(class) / p(X) where p(X) is the same for all classes, so it doesn't have any influence on the order of probabilities. Or, another way to say the same, is to say that p(class|X) is proportional to p(X|class) p(class) (up to a constant). As you may guessed already, that's where Bayes comes from.
  2. The formula above does not have any model assumptions, it's a probability theory law. However, it's too hard to apply it directly since p(X|class) denotes probability of encountering message X in a class. No way we would have enough data to estimate probability of every single message possible. So here goes our model assumption: we say that words of a message are independent (which is, obviously, wrong and incorrect, thus method is Naive). This leads us to p(X|class) = p(x1|class) * ... * p(xn|class) where n is amount of words in X.

Now we need somehow estimate probabilities p(x|class). x here is not a whole message, but just a (one) word. Intuitively, probability of getting some word from a given a class is equal to the number of occurrences of that word in that class divided by the total size of the class: #(word, class) / #(class) (or, we could use Bayes' rule once again: p(x|class) = p(x, class) / p(class)).

Accordingly, since p(x|class) is a distribution over xs, we need it to sum to 1. Thus, if we apply Laplace smoothing by saying p(x|class) = (#(x, class) + a) / Z where Z is a normalizing constant, we need to enforce the following constraint: sum_x p(x|class) = 1, or, equivalently, sum_x(#(x, class) + a) = Z. It gives us Z = #(class) + a * N where N is number of all words (just number of words, not their occurrences!)

Upvotes: 2

Related Questions