Reputation: 890
I am training a classifier using sklearn and I am doing something wrong. In the code below I put exactly the same values for training and predicting and the results are not the same. How does this happen?
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf = text_clf.fit(X, y)
predicted = text_clf.predict(X)
print set(np.asarray(y)) == set(predicted) #gives false
The data X is a list of unicode and y list of numbers (1 and 0).
Upvotes: 1
Views: 1054
Reputation: 77837
In general, no, the two will not be equal. Unless you use a method that converges only on 100% training accuracy, you will get less than perfect fitting.
Forcing 100% accuracy in ML generally causes over-fitting, resulting in a model that is so specifically adapted to the training set, that it gives unacceptably poor performance on any later (e.g. real-world) input.
If you require 100% accuracy, then Machine Learning is altogether the wrong paradigm for your problem. You need deterministic classification, not an adaptive heuristic.
Upvotes: 4