sklearn: Naive Bayes classifier gives low accuracy

Question

I have a dataset which includes 200000 labelled training examples. For each training example I have 10 features, including both continuous and discrete. I'm trying to use sklearn package of python in order to train the model and make predictions but I have some troubles (and some questions too).

First let me write the code which I have written so far:

from sklearn.naive_bayes import GaussianNB
# data contains the 200 000 examples
# targets contain the corresponding labels for each training example
gnb = GaussianNB()
gnb.fit(data, targets)
predicted = gnb.predict(data)

The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.

Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?

Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?

Any thoughts or suggestions will be much appreciated.

sklearn: Naive Bayes classifier gives low accuracy

Answers (1)

Related Questions