Reputation: 1289
I want to calculate tf-idf from the documents below. I'm using python and pandas.
import pandas as pd
df = pd.DataFrame({'docId': [1,2,3],
'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})
First, I thought I would need to get word_count for each row. So I wrote a simple function:
def word_count(sent):
word2cnt = dict()
for word in sent.split():
if word in word2cnt: word2cnt[word] += 1
else: word2cnt[word] = 1
return word2cnt
And then, I applied it to each row.
df['word_count'] = df['sent'].apply(word_count)
But now I'm lost. I know there's an easy method to calculate tf-idf if I use Graphlab, but I want to stick with an open source option. Both Sklearn and gensim look overwhelming. What's the simplest solution to get tf-idf?
Upvotes: 47
Views: 78227
Reputation: 2108
Two straightforward solutions using TfidfVectorizer
from sklearn
.
a) If your corpus
is a pandas.Series
:
vectorizer = TfidfVectorizer()
_X = vectorizer.fit_transform(corpus)
X = pd.DataFrame(_X.todense(), index=corpus.index, columns=vectorizer.vocabulary_)
X.head()
b) If your corpus is a list
:
vectorizer = TfidfVectorizer()
_X = vectorizer.fit_transform(corpus)
X = pd.DataFrame(_X.todense(), columns=vectorizer.vocabulary_)
X.head()
Upvotes: 0
Reputation: 19
I think Christian Perone's example is the most straightforward example of how to use Count Vectorizer and TF_IDF. This is directly from his webpage. But I am also benefitting from the answers here.
https://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/
from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(train_set)
print "Vocabulary:", count_vectorizer.vocabulary
# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
freq_term_matrix = count_vectorizer.transform(test_set)
print freq_term_matrix.todense()
#[[0 1 1 1]
#[0 2 1 0]]
Now that we have the frequency term matrix (called freq_term_matrix), we can instantiate the TfidfTransformer, which is going to be responsible to calculate the tf-idf weights for our term frequency matrix:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
print "IDF:", tfidf.idf_
# IDF: [ 0.69314718 -0.40546511 -0.40546511 0.
]
Note that I’ve specified the norm as L2, this is optional (actually the default is L2-norm), but I’ve added the parameter to make it explicit to you that it it’s going to use the L2-norm. Also note that you can see the calculated idf weight by accessing the internal attribute called idf_. Now that fit() method has calculated the idf for the matrix, let’s transform the freq_term_matrix to the tf-idf weight matrix:
--- I had to make the following changes for Python and note that .vocabulary_ includes the word "the". I have not found or built a solution for that... yet---
from sklearn.feature_extraction.text import CountVectorizer
train_set = ["The sky is blue.", "The sun is bright."]
test_set = ["The sun in the sky is bright.", "We can see the shining sun, the bright sun."]
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(train_set)
print ("Vocabulary:")
print(count_vectorizer.vocabulary_)
Vocab = list(count_vectorizer.vocabulary_)
print(Vocab)
# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
freq_term_matrix = count_vectorizer.transform(test_set)
print (freq_term_matrix.todense())
count_array = freq_term_matrix.toarray()
df = pd.DataFrame(data=count_array, columns=Vocab)
print(df)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
print ("IDF:")
print(tfidf.idf_)
Upvotes: 0
Reputation: 337
A simple solution is to use texthero:
import texthero as hero
df['tfidf'] = hero.tfidf(df['sent'])
In [5]: df.head()
Out[5]:
docId sent tfidf
0 1 This is the first sentence [0.3816141458138271, 0.6461289150464732, 0.381...
1 2 This is the second sentence [0.3816141458138271, 0.0, 0.3816141458138271, ...
2 3 This is the third sentence [0.3816141458138271, 0.0, 0.3816141458138271, ...
Upvotes: 6
Reputation: 72
I found a slightly different method using CountVectorizer from sklearn. --count vectorizer: Ultraviolet Analysis word frequency --preprocessing/cleaning text: Usman Malik scraping tweets preprocessing I won't be covering preprocessing in this answer. Basically what you want to do is import CountVectorizer and fit your data to the CountVectorizer object, which will let you access the .vocabulary._items() feature, which will give you the vocabulary of your dataset (the unique words present and their frequencies, given any limiting parameters you pass into CountVectorizer like match feature number, etc)
Then, you're going to use the Tfidtransformer to generate tf-idf weights for the terms in a similar manner
I am coding in a jupyter notebook file using pandas and the pycharm ide
Here is a code snippet:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
countVec = CountVectorizer(max_features= 5000, stop_words='english', min_df=.01, max_df=.90)
#%%
#use CountVectorizer.fit(self, raw_documents[, y] to learn vocabulary dictionary of all tokens in raw documents
#raw documents in this case will betweetsFrameWords["Text"] (processed text)
countVec.fit(tweetsFrameWords["Text"])
#useful debug, get an idea of the item list you generated
list(countVec.vocabulary_.items())
#%%
#convert to bag of words
#sparse matrix representation? (README: could use an edit/explanation)
countVec_count = countVec.transform(tweetsFrameWords["Text"])
#%%
#make array from number of occurrences
occ = np.asarray(countVec_count.sum(axis=0)).ravel().tolist()
#make a new data frame with columns term and occurrences, meaning word and number of occurences
bowListFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'occurrences': occ})
print(bowListFrame)
#sort in order of number of word occurences, most->least. if you leave of ascending flag should default ASC
bowListFrame.sort_values(by='occurrences', ascending=False).head(60)
#%%
#now, convert to a more useful ranking system, tf-idf weights
#TfidfTransformer: scale raw word counts to a weighted ranking using the
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
tweetTransformer = TfidfTransformer()
#initial fit representation using transformer object
tweetWeights = tweetTransformer.fit_transform(countVec_count)
#follow similar process to making new data frame with word occurrences, but with term weights
tweetWeightsFin = np.asarray(tweetWeights.mean(axis=0)).ravel().tolist()
#now that we've done Tfid, make a dataframe with weights and names
tweetWeightFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'weight': tweetWeightsFin})
print(tweetWeightFrame)
tweetWeightFrame.sort_values(by='weight', ascending=False).head(20)
Upvotes: 0
Reputation: 2399
Scikit-learn implementation is really easy :
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])
There are plenty of parameters you can specify. See the documentation here
The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()
In [44]: x.toarray()
Out[44]:
array([[ 0.64612892, 0.38161415, 0. , 0.38161415, 0.38161415,
0. , 0.38161415],
[ 0. , 0.38161415, 0.64612892, 0.38161415, 0.38161415,
0. , 0.38161415],
[ 0. , 0.38161415, 0. , 0.38161415, 0.38161415,
0.64612892, 0.38161415]])
Upvotes: 65