Oliver
Oliver

Reputation: 128

Custom Naive Bayes Implementation vs sklearn.naive_bayes for multimodal data

I am currently playing around with some classifications using the Naive Bayes algorithm. For this it is normally assumed that p(x|C_i) is Gaussian. Under this assumption I would assume that this approach does not perform well when this assumption is not met, i.e. the data distribution is for example a mixture of Gaussians. For this I implemented a very basic test, once using GaussianNB from sklearn and once my custom implementation:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

from sklearn.naive_bayes import GaussianNB

def gaussian(x, mu, sig):
    return np.exp(-np.power(x - mu, 2.) / (2 * np.power(sig, 2.)))

x = np.arange(0, 10, 0.1)

MULTIMODAL = False # toggle for unimodal / multimodal

def class1_proba(x):
    if MULTIMODAL:
        return gaussian(x, 5, 2)
    else:
        return gaussian(x, 1, 2)

def class2_proba(x):
    if MULTIMODAL:
        return 1 * gaussian(x, 9, 2)  + 1 * gaussian(x, 1, 2)
    else:
        return gaussian(x, 9, 2)

class1 = class1_proba(x)
class2 = class2_proba(x)

data_x = np.expand_dims(np.random.rand(10000) * 10, axis=1)
data_y = np.squeeze(np.argmax(np.asarray([class1_proba(data_x), class2_proba(data_x)]), axis=0))

model = GaussianNB()
model.fit(data_x, data_y)
predicted = model.predict(data_x)
print("NB:")
print(accuracy_score(data_y, predicted))

pC1 = data_y[data_y == 0].shape[0] / float(data_y.shape[0])
pC2 = data_y[data_y == 1].shape[0] / float(data_y.shape[0])

mean1 = np.mean(data_x[data_y == 0])
mean2 = np.mean(data_x[data_y == 1])

var1 = np.var(data_x[data_y == 0])
var2 = np.var(data_x[data_y == 1])

def get_prediction(x):
    p1 = pC1 * gaussian(x, mean1, var1)
    p2 = pC2 * gaussian(x, mean2, var2)
    if p1 > p2:
        return 0
    else:
        return 1

custom_prediction = []
for i in range(data_x.shape[0]):
    custom_prediction.append(get_prediction(data_x[i]))

custom_prediction = np.asarray(custom_prediction)
print("Custom:")
print(accuracy_score(data_y, custom_prediction))

plt.plot(x, class1)
plt.plot(x, class2)
# plt.scatter(data_x, data_y)
plt.scatter(data_x, custom_prediction)
plt.show()

When each p(x|C_i) is following a single Gaussian, both methods provide a very good accuracy (99 %) as expected. However, then p(x|C_i) gets changed to be more complex, my simple method fails (this I would also expect), but GaussianNB still provides very good results (90 % accuracy).

It would be great if you could give me your thoughts about this. Why does GaussianNB still works when the data is multimodal? Is it suited for this? In the docu / code I could not find anything ... Do I have a stupid bug in my implementation? Why does it give a different result?

Thanks a lot!

Upvotes: 2

Views: 203

Answers (1)

Sardor Abdirayimov
Sardor Abdirayimov

Reputation: 109

There is indeed secret sauce in scikit learn implementation.

Log probability vs Ordinary Probability:

They are using log probability, whereas your code works on ordinary probability.

I have modified your gaussian function to act as log probability, which returns the same results in both unimodal and multimodal states.

def gaussian(x, mu, sig):
    return (1 / (np.sqrt(2 * np.pi * sig))) * np.exp(-np.power(x - mu, 2.) / (2 * sig))

Log probability prevented from numerical underflow and improves stability. Read more here.

There are other minor things inside the skikit-learn implementation:

For more information, check out their _partial_fit in GaussianNB class method. They use the epsilon variable.

Upvotes: 0

Related Questions