Reputation: 9736
To apply ML algorithm on text, it has to be represented numerically. Some ways to do this using sklearn are:
CountVectorizer
CountVectorizer + TfidfTransformer
TfidfVectorizer
What is the difference between CountVectorizer+TfidfTransformer and TfidfVectorizer?
Upvotes: 1
Views: 2480
Reputation: 89
The following code demonstrates what the documentation, per mbatchkarov, means by "follow up": you multiply the outputs of the two functions and then normalize.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import (
CountVectorizer, TfidfTransformer, TfidfVectorizer
)
corpus = ['apple banana orange onion corn',
'banana banana orange pineapple coffee',
'orange lemon lime orange',
'lime vodka gin orange apple apple',
'potato potato tomato pineapple',
'coffee']
tf = CountVectorizer()
idf = TfidfTransformer()
tf_ft = tf.fit_transform(corpus)
idf.fit(tf_ft)
vocab = [ti[0] for ti in sorted(list(tf.vocabulary_.items()),
key=lambda x: x[1])]
tf = pd.DataFrame(tf_ft.toarray(), columns=vocab)
idf = pd.Series(idf.idf_, index=vocab)
tfidf_manual = tf * idf
tfidf_manual /= np.sqrt(np.sum(np.square(tfidf_manual.values),
axis=1,
keepdims=True))
tfidf_function = pd.DataFrame(TfidfVectorizer()
.fit_transform(corpus)
.toarray(),
columns=vocab)
assert np.allclose(tfidf_manual, tfidf_function)
tfidf_manual
Upvotes: 0
Reputation: 424
With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.
With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.
So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:
Reference: https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.YHybLOhKhPY
Upvotes: 0
Reputation: 16109
None, see the top of the documentation page:
sklearn.feature_extraction.text.TfidfVectorizer
...
Equivalent to CountVectorizer followed by TfidfTransformer.
Upvotes: 1