Difficulties to get the correct posterior value in a Naive Bayes Implementation

Question

For studying purposes, I've tried to implement this "lesson" using python but "without" sckitlearn or something similar.

My attempt code is the follow:

import pandas, math

training_data = [
        ['A great game','Sports'],
        ['The election was over','Not sports'],
        ['Very clean match','Sports'],
        ['A clean but forgettable game','Sports'],
        ['It was a close election','Not sports']
]

text_to_predict = 'A very close game'
data_frame = pandas.DataFrame(training_data, columns=['data','label'])
data_frame = data_frame.applymap(lambda s:s.lower() if type(s) == str else s)
text_to_predict = text_to_predict.lower()
labels = data_frame.label.unique()
word_frequency = data_frame.data.str.split(expand=True).stack().value_counts()
unique_words_set = set()
unique_words = data_frame.data.str.split().apply(unique_words_set.update)
total_unique_words = len(unique_words_set)

word_frequency_per_labels = []
for l in labels:
    word_frequency_per_label = data_frame[data_frame.label == l].data.str.split(expand=True).stack().value_counts()
    for w, f in word_frequency_per_label.iteritems():
        word_frequency_per_labels.append([w,f,l])

word_frequency_per_labels_df = pandas.DataFrame(word_frequency_per_labels, columns=['word','frequency','label'])
laplace_smoothing = 1
results = []
for l in labels:
    p = []
    total_words_in_label = word_frequency_per_labels_df[word_frequency_per_labels_df.label == l].frequency.sum()
    for w in text_to_predict.split():
        x = (word_frequency_per_labels_df.query('word == @w and label == @l').frequency.to_list()[:1] or [0])[0]
        p.append((x + laplace_smoothing) / (total_words_in_label + total_unique_words))
    results.append([l,math.prod(p)])

print(results)
result = pandas.DataFrame(results, columns=['labels','posterior']).sort_values('posterior',ascending = False).labels.iloc[0]
print(result)

In the blog lesson their results are:

But my result were:

[['sports', 4.607999999999999e-05], ['not sports', 1.4293831139825827e-05]]

So, what did I do wrong in my python implementation? How can I get the same results?

Thanks in advance

nick · Accepted Answer

You haven't multiplied by the priors p(Sport) = 3/5 and p(Not Sport) = 2/5. So just updating your answers by these ratios will get you to the correct result. Everything else looks good.

So for example you implement p(a|Sports) x p(very|Sports) x p(close|Sports) x p(game|Sports) in your math.prod(p) calculation but this ignores the term p(Sport). So adding this in (and doing the same for the not sport condition) fixes things.

In code this can be achieved by:

prior = (data_frame.label == l).mean()
results.append([l,prior*math.prod(p)])

Difficulties to get the correct posterior value in a Naive Bayes Implementation

Answers (2)

Related Questions