Reputation: 815
I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it?
Upvotes: 2
Views: 4781
Reputation: 1317
The IDF component of TF*IDF can harm your classification accuracy in some cases.
Let suppose the following artificial, easy classification task, made for the sake of illustration:
Suppose now that in Class A, you have 100 000 examples and in class B, 1000 examples.
What will happen to TFIDF? The inverse document frequency of corn will be very low (because it is found in almost all documents), and the feature 'corn' will get a very small TFIDF, which is the weight of the feature used by the classifier. Obviously, 'corn' was THE best feature for this classification task. This is an example where TFIDF may reduce your classification accuracy. In more general terms:
Upvotes: 11
Reputation: 547
You can heuristically determine whether the usage of IDF on your training data decreases your predictive accuracy by performing grid search as appropriate. For example, if you are working in sklearn, and you want to determine whether IDF decreases the predictive accuracy of your model, you can perform a grid search on the use_idf parameter of the TfidfVectorizer. As an example, this code would implement the gridsearch algorithm on the selection of IDF for classification with SGDClassifier (you must import all the objects being instantiated first):
# import all objects first
X = # your training data
y = # your labels
pipeline = Pipeline([('tfidf',TfidfVectorizer()),
('sgd',SGDClassifier())])
params = {'tfidf__use_idf':(False,True)}
gridsearch = GridSearch(pipeline,params)
gridsearch.fit(X,y)
print(gridsearch.best_params_)
The output would be either:
Parameters selected as the best fit:
{'tfidf__use_idf': False}
or
{'tfidf__use_idf': True}
Upvotes: 2
Reputation: 1187
TF-IDF as far as I understand is a feature. TF is term frequency i.e. frequency of occurence in a document. IDF is inverse document frequncy i.e frequency of documents in which the term occurs.
Here, the model is using the TF-IDF info in the training corpus to estimate the new documents. For a very simple example, Say a document with word bad has pretty high term frequency of word bad in training set will sentiment label as negative. So, any new document containing bad will be more likely to be negative.
For the accuracy you can manaually select training corpus which contains mostly used negative or positive words. This will boost the accuracy.
Upvotes: 0