ricsh
ricsh

Reputation: 63

Calculation of probabilities in Naive Bayes in C#

I'm working on a Naive Bayes solution for C# where there are two possible outcomes. I have found a small sample code but was wondering if anyone would be able to explain the last line.

The analyzer is finding the probability a word belongs to 1 of two categories

cat1count is the number of words found in category 1 ( if the word is found 2 times in category 1 it would be 2 / total words in category 1)

cat1total = total number of words in category 1

as I understand it, bw is the probability the word is in category 1 and gw is the probability the word is in category 2

pw and fw are where I start to get a bit lost. The full source code can be found here.

        float bw = cat1count / cat1total;
        float gw = cat2count / cat2total;
        float pw = ((bw) / ((bw) + (gw)));
        float
            s = 1f,
            x = .5f,
            n = cat1count + cat2count;
        float fw = ((s * x) + (n * pw)) / (s + n);

what is fw? I understand what bw, gw, and pw are.

Upvotes: 2

Views: 1668

Answers (1)

TooTone
TooTone

Reputation: 8126

This code is called over and over again for each particular word w in the text (e.g. tweet) being analyzed. All the variables are conditional probabilities estimated using frequencies.

bw is the probability that the word w is seen given that the word is a category 1 text

gw is the probability that the word w is seen given that the word is a category 2 text

pw rescales the probability bw so that rarely seen words are on a similar scale to frequently seen words (mathematically, the division indicates that pw is a conditional probability)

fw simply shifts the scale so that pw can't be zero (or one). So if, for example, pw=0 and n=10, fw = ((1 * 0.5) + (10 * 0)) / (1 + 10) = 0.045. (In general, a good way to understand this code is to play around with some different numbers and see what happens.)

In Naive Bayes, as you may know, the conditional probabilities are multiplied together (in this case via the LogProbability function in the github Analyzer.cs file you pointed me at), so you're in trouble if you have a zero conditional probability anywhere in the multiplications, because the end result would be zero. So, it's common practice to substitute a small number instead of zero, which is the purpose of fw.

Upvotes: 1

Related Questions