Reputation: 521
So I've the following table with each row is a document and each column is words and no of occurrence of the words.
|doc|apple|banana|cat|
|---|---|---|---|
|1|2|0|0|
|2|0|0|2|
|3|0|2|0|
Is there any method to convert these count vectorized table to tf-idf vectorizer?
Edit: My solution for it. Let me know if this is correct.
def get_tfidf(df_tfidf):
total_docs = df_tfidf.shape[0]
#Term Frequency
#(Number of times term w appears in a document) / (Total number of
#terms in the document)
total_words_doc = df_tfidf.astype(bool).sum(axis=1)
tf = df_tfidf.values/total_words_doc[:,None]
#Inverse document frequency
#log_e(Total number of documents / Number of documents with term w in
#it)
words_in_doc = df_tfidf.astype(bool).sum(axis=0)
idf = np.log(total_docs/words_in_doc)
tf_idf = tf*idf.values[None,:]
return tf_idf
Upvotes: 0
Views: 1351
Reputation: 11
Use TfidfTransformer. Using TfidfVectorizer on counts might give us undesired results. Let's change the term frequency matrix you provided to understand the difference.
doc | apple | banana | cat |
---|---|---|---|
1 | 2 | 5 | 0 |
2 | 0 | 0 | 3 |
3 | 0 | 6 | 4 |
Using TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf_trans = transformer.fit_transform(df)
tfidf_trans_df = pd.DataFrame(tfidf_trans.toarray(), index = df.index, columns=df.columns)
print(tfidf_trans_df)
Output:
doc | apple | banana | cat |
---|---|---|---|
1 | 0.465494 | 0.885051 | 0.0 |
2 | 0.0 | 0.0 | 1.0 |
3 | 0.0 | 0.832050 | 0.5547 |
Using TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer= TfidfVectorizer()
tfidf_vect = vectorizer.fit_transform(df)
tfidf_vect_df = pd.DataFrame(tfidf_vect.toarray(), index = df.index, columns=df.columns)
print(tfidf_vect_df)
Output:
doc | apple | banana | cat |
---|---|---|---|
1 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 0.0 | 1.0 |
Please refer https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting for a detailed explanation.
Upvotes: 1
Reputation: 2897
Suppose you have a Count Vectorizer as a pandas.DataFrame
like this:
import pandas as pd
data = [[1,2,0,0],[2,0,0,2],[3,0,2,0]]
df = pd.DataFrame(data,columns=['doc','apple','banana','cat'])
df
Output:
doc apple banana cat
0 1 2 0 0
1 2 0 0 2
2 3 0 2 0
Then you can use sklearn.feature_extraction.text.TfidfVectorizer
to get the tf-idf vector like this:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df)
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)
Output:
apple banana cat doc
0 0.0 0.0 0.0 1.0
1 1.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0
3 0.0 0.0 1.0 0.0
Upvotes: 0