Reputation: 39
I am working on a multilabel text classification problem with 10 labels. The dataset is small, +- 7000 items and +-7500 labels in total. I am using python sci-kit learn and something strange came up in the results. As a baseline I started out with using the countvectorizer and was actually planning on using the tfidf vectorizer which I thought would work better. But it doesn't.. with the countvectorizer I get a performance of a 0,1 higher f1score. (0,76 vs 0,65)
I cannot wrap my head around why this could be the case? There are 10 categories and one is called miscellaneous. Especially this one gets a much lower performance with tfidf.
Does anyone know when tfidf could perform worse than count?
Upvotes: 0
Views: 1640
Reputation: 800
There is no reason why idf would give more information for a classification task. It performs well for search and ranking, but classification needs to gather similarity, not singularities.
IDF is meant to spot the singularity between one sample vs the rest of the corpus, what you are looking for is the singularity between one sample vs the other clusters. IDF smoothens the intra-cluster TF similarity.
Upvotes: 1
Reputation: 800
The question is, why not ? Both are different solutions.
What is your dataset, how many words, how are they labelled, how do you extract your features ? countvectorizer simply count the words, if it does a good job, so be it.
Upvotes: 1