Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39820

sklearn: Naive Bayes classifier gives low accuracy

I have a dataset which includes 200000 labelled training examples. For each training example I have 10 features, including both continuous and discrete. I'm trying to use sklearn package of python in order to train the model and make predictions but I have some troubles (and some questions too).

First let me write the code which I have written so far:

from sklearn.naive_bayes import GaussianNB
# data contains the 200 000 examples
# targets contain the corresponding labels for each training example
gnb = GaussianNB()
gnb.fit(data, targets)
predicted = gnb.predict(data)

The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.

Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?

Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?

Any thoughts or suggestions will be much appreciated.

Upvotes: 3

Views: 5001

Answers (1)

lejlot
lejlot

Reputation: 66805

The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.

This is not big error for Naive Bayes, this is extremely simple classifier and you should not expect it to be strong, more data probably won't help. Your gaussian estimators are probably already very good, simply Naive assumptions are the problem. Use stronger model. You can start with Random Forest since it is very easy to use even by non-experts in the field.

Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?

No, it is not, you should use different distributions in discrete features, however scikit-learn does not support that, you would have to do this manually. As said before - change your model.

Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?

Nothing is done automatically in this manner, you need to do this on your own (scikit learn has lots of tools for that - see the cross validation pacakges).

Upvotes: 7

Related Questions