Reputation: 128
I am currently playing around with some classifications using the Naive Bayes algorithm. For this it is normally assumed that p(x|C_i) is Gaussian. Under this assumption I would assume that this approach does not perform well when this assumption is not met, i.e. the data distribution is for example a mixture of Gaussians. For this I implemented a very basic test, once using GaussianNB from sklearn and once my custom implementation:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
def gaussian(x, mu, sig):
return np.exp(-np.power(x - mu, 2.) / (2 * np.power(sig, 2.)))
x = np.arange(0, 10, 0.1)
MULTIMODAL = False # toggle for unimodal / multimodal
def class1_proba(x):
if MULTIMODAL:
return gaussian(x, 5, 2)
else:
return gaussian(x, 1, 2)
def class2_proba(x):
if MULTIMODAL:
return 1 * gaussian(x, 9, 2) + 1 * gaussian(x, 1, 2)
else:
return gaussian(x, 9, 2)
class1 = class1_proba(x)
class2 = class2_proba(x)
data_x = np.expand_dims(np.random.rand(10000) * 10, axis=1)
data_y = np.squeeze(np.argmax(np.asarray([class1_proba(data_x), class2_proba(data_x)]), axis=0))
model = GaussianNB()
model.fit(data_x, data_y)
predicted = model.predict(data_x)
print("NB:")
print(accuracy_score(data_y, predicted))
pC1 = data_y[data_y == 0].shape[0] / float(data_y.shape[0])
pC2 = data_y[data_y == 1].shape[0] / float(data_y.shape[0])
mean1 = np.mean(data_x[data_y == 0])
mean2 = np.mean(data_x[data_y == 1])
var1 = np.var(data_x[data_y == 0])
var2 = np.var(data_x[data_y == 1])
def get_prediction(x):
p1 = pC1 * gaussian(x, mean1, var1)
p2 = pC2 * gaussian(x, mean2, var2)
if p1 > p2:
return 0
else:
return 1
custom_prediction = []
for i in range(data_x.shape[0]):
custom_prediction.append(get_prediction(data_x[i]))
custom_prediction = np.asarray(custom_prediction)
print("Custom:")
print(accuracy_score(data_y, custom_prediction))
plt.plot(x, class1)
plt.plot(x, class2)
# plt.scatter(data_x, data_y)
plt.scatter(data_x, custom_prediction)
plt.show()
When each p(x|C_i) is following a single Gaussian, both methods provide a very good accuracy (99 %) as expected. However, then p(x|C_i) gets changed to be more complex, my simple method fails (this I would also expect), but GaussianNB still provides very good results (90 % accuracy).
It would be great if you could give me your thoughts about this. Why does GaussianNB still works when the data is multimodal? Is it suited for this? In the docu / code I could not find anything ... Do I have a stupid bug in my implementation? Why does it give a different result?
Thanks a lot!
Upvotes: 2
Views: 203
Reputation: 109
There is indeed secret sauce in scikit learn implementation.
They are using log probability, whereas your code works on ordinary probability.
I have modified your gaussian
function to act as log probability, which returns the same results in both unimodal
and multimodal
states.
def gaussian(x, mu, sig):
return (1 / (np.sqrt(2 * np.pi * sig))) * np.exp(-np.power(x - mu, 2.) / (2 * sig))
Log probability prevented from numerical underflow and improves stability. Read more here.
For more information, check out their _partial_fit
in GaussianNB
class method. They use the epsilon variable.
Upvotes: 0