Olivier_s_j
Olivier_s_j

Reputation: 5202

Is a countvectorizer the same as tfidfvectorizer with use_idf=false?

As the title states: Is a countvectorizer the same as tfidfvectorizer with use_idf=false ? If not why not ?

So does this also mean that adding the tfidftransformer here is redundant ?

vect = CountVectorizer(min_df=1)
tweets_vector = vect.fit_transform(corpus)
tf_transformer = TfidfTransformer(use_idf=False).fit(tweets_vector)
tweets_vector_tf = tf_transformer.transform(tweets_vector)

Upvotes: 13

Views: 16867

Answers (2)

Rolf H Nelson
Rolf H Nelson

Reputation: 61

As larsmans said, TfidfVectorizer(use_idf=False, normalize=None, ...) is supposed to behave the same as CountVectorizer.

In the current version (0.14.1), there's a bug where TfidfVectorizer(binary=True, ...) silently leaves binary=False, which can throw you off during a grid search for the best parameters. (CountVectorizer, in contrast, sets the binary flag correctly.) This appears to be fixed in future (post-0.14.1) versions.

Upvotes: 1

Fred Foo
Fred Foo

Reputation: 363817

No, they're not the same. TfidfVectorizer normalizes its results, i.e. each vector in its output has norm 1:

>>> CountVectorizer().fit_transform(["foo bar baz", "foo bar quux"]).A
array([[1, 1, 1, 0],
       [1, 0, 1, 1]])
>>> TfidfVectorizer(use_idf=False).fit_transform(["foo bar baz", "foo bar quux"]).A
array([[ 0.57735027,  0.57735027,  0.57735027,  0.        ],
       [ 0.57735027,  0.        ,  0.57735027,  0.57735027]])

This is done so that dot-products on the rows are cosine similarities. Also TfidfVectorizer can use logarithmically discounted frequencies when given the option sublinear_tf=True.

To make TfidfVectorizer behave as CountVectorizer, give it the constructor options use_idf=False, normalize=None.

Upvotes: 33

Related Questions