Reputation: 63
I'm working on a Naive Bayes solution for C# where there are two possible outcomes. I have found a small sample code but was wondering if anyone would be able to explain the last line.
The analyzer is finding the probability a word belongs to 1 of two categories
cat1count
is the number of words found in category 1 ( if the word is found 2 times in category 1 it would be 2 / total words in category 1)
cat1total
= total number of words in category 1
as I understand it, bw
is the probability the word is in category 1 and gw
is the probability the word is in category 2
pw
and fw
are where I start to get a bit lost. The full source code can be found here.
float bw = cat1count / cat1total;
float gw = cat2count / cat2total;
float pw = ((bw) / ((bw) + (gw)));
float
s = 1f,
x = .5f,
n = cat1count + cat2count;
float fw = ((s * x) + (n * pw)) / (s + n);
what is fw
? I understand what bw
, gw
, and pw
are.
Upvotes: 2
Views: 1668
Reputation: 8126
This code is called over and over again for each particular word w
in the text (e.g. tweet) being analyzed. All the variables are conditional probabilities estimated using frequencies.
bw
is the probability that the word w
is seen given that the word is a category 1 text
gw
is the probability that the word w
is seen given that the word is a category 2 text
pw
rescales the probability bw
so that rarely seen words are on a similar scale to frequently seen words (mathematically, the division indicates that pw
is a conditional probability)
fw
simply shifts the scale so that pw
can't be zero (or one). So if, for example, pw=0
and n=10
, fw = ((1 * 0.5) + (10 * 0)) / (1 + 10) = 0.045
. (In general, a good way to understand this code is to play around with some different numbers and see what happens.)
In Naive Bayes, as you may know, the conditional probabilities are multiplied together (in this case via the LogProbability
function in the github Analyzer.cs file you pointed me at), so you're in trouble if you have a zero conditional probability anywhere in the multiplications, because the end result would be zero. So, it's common practice to substitute a small number instead of zero, which is the purpose of fw
.
Upvotes: 1