variable
variable

Reputation: 9736

What is the difference between CountVectorizer+TfidfTransformer and TfidfVectorizer

To apply ML algorithm on text, it has to be represented numerically. Some ways to do this using sklearn are:

  1. CountVectorizer

  2. CountVectorizer + TfidfTransformer

  3. TfidfVectorizer

What is the difference between CountVectorizer+TfidfTransformer and TfidfVectorizer?

Upvotes: 1

Views: 2480

Answers (3)

Chris Coffee
Chris Coffee

Reputation: 89

The following code demonstrates what the documentation, per mbatchkarov, means by "follow up": you multiply the outputs of the two functions and then normalize.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfTransformer, TfidfVectorizer
)

corpus = ['apple banana orange onion corn',
          'banana banana orange pineapple coffee',
          'orange lemon lime orange',
          'lime vodka gin orange apple apple',
          'potato potato tomato pineapple',
          'coffee']

tf = CountVectorizer()
idf = TfidfTransformer()

tf_ft = tf.fit_transform(corpus)
idf.fit(tf_ft)

vocab = [ti[0] for ti in sorted(list(tf.vocabulary_.items()),
                                key=lambda x: x[1])]

tf = pd.DataFrame(tf_ft.toarray(), columns=vocab)
idf = pd.Series(idf.idf_, index=vocab)
tfidf_manual = tf * idf
tfidf_manual /= np.sqrt(np.sum(np.square(tfidf_manual.values),
                               axis=1,
                               keepdims=True))

tfidf_function = pd.DataFrame(TfidfVectorizer()
                              .fit_transform(corpus)
                              .toarray(),
                              columns=vocab)

assert np.allclose(tfidf_manual, tfidf_function)

tfidf_manual

Upvotes: 0

Naveen Kumar
Naveen Kumar

Reputation: 424

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:

  • If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
  • If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
  • If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.

Reference: https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.YHybLOhKhPY

Upvotes: 0

mbatchkarov
mbatchkarov

Reputation: 16109

None, see the top of the documentation page:

sklearn.feature_extraction.text.TfidfVectorizer
...
Equivalent to CountVectorizer followed by TfidfTransformer. 

Upvotes: 1

Related Questions