Nikola
Nikola

Reputation: 890

Classifier fit and predict on the same data gives different result

I am training a classifier using sklearn and I am doing something wrong. In the code below I put exactly the same values for training and predicting and the results are not the same. How does this happen?

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(X, y)

predicted = text_clf.predict(X)

print set(np.asarray(y)) == set(predicted) #gives false

The data X is a list of unicode and y list of numbers (1 and 0).

Upvotes: 1

Views: 1054

Answers (1)

Prune
Prune

Reputation: 77837

In general, no, the two will not be equal. Unless you use a method that converges only on 100% training accuracy, you will get less than perfect fitting.

Forcing 100% accuracy in ML generally causes over-fitting, resulting in a model that is so specifically adapted to the training set, that it gives unacceptably poor performance on any later (e.g. real-world) input.

If you require 100% accuracy, then Machine Learning is altogether the wrong paradigm for your problem. You need deterministic classification, not an adaptive heuristic.

Upvotes: 4

Related Questions