Reputation: 5202
As the title states: Is a countvectorizer
the same as tfidfvectorizer
with use_idf=false ? If not why not ?
So does this also mean that adding the tfidftransformer
here is redundant ?
vect = CountVectorizer(min_df=1)
tweets_vector = vect.fit_transform(corpus)
tf_transformer = TfidfTransformer(use_idf=False).fit(tweets_vector)
tweets_vector_tf = tf_transformer.transform(tweets_vector)
Upvotes: 13
Views: 16867
Reputation: 61
As larsmans said, TfidfVectorizer(use_idf=False, normalize=None, ...) is supposed to behave the same as CountVectorizer.
In the current version (0.14.1), there's a bug where TfidfVectorizer(binary=True, ...) silently leaves binary=False, which can throw you off during a grid search for the best parameters. (CountVectorizer, in contrast, sets the binary flag correctly.) This appears to be fixed in future (post-0.14.1) versions.
Upvotes: 1
Reputation: 363817
No, they're not the same. TfidfVectorizer
normalizes its results, i.e. each vector in its output has norm 1:
>>> CountVectorizer().fit_transform(["foo bar baz", "foo bar quux"]).A
array([[1, 1, 1, 0],
[1, 0, 1, 1]])
>>> TfidfVectorizer(use_idf=False).fit_transform(["foo bar baz", "foo bar quux"]).A
array([[ 0.57735027, 0.57735027, 0.57735027, 0. ],
[ 0.57735027, 0. , 0.57735027, 0.57735027]])
This is done so that dot-products on the rows are cosine similarities. Also TfidfVectorizer
can use logarithmically discounted frequencies when given the option sublinear_tf=True
.
To make TfidfVectorizer
behave as CountVectorizer
, give it the constructor options use_idf=False, normalize=None
.
Upvotes: 33