Learner
Learner

Reputation: 691

Calculate Tf-Idf Scores in pandas?

I want to calculate tf and idf seperately from the documents below. I'm using python and pandas.

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

I want to calculate using formula for Tf-Idf not using Sklearn library.

After tokenization,I have used this for TF calculation:

tf = df.sent.apply(pd.value_counts).fillna(0) 

but this giving me a count but I want ratio of (count/total number of words).

For Idf: df[df['sent'] > 0] / (1 + len(df['sent'])

but it doesn't seems to work. I want both Tf and Idf as pandas series format.

Edit

for tokenization I used df['sent'] = df['sent'].apply(word_tokenize) I got idf scores as :

tfidf = TfidfVectorizer()
feature_array = tfidf.fit_transform(df['sent'])
d=(dict(zip(tfidf.get_feature_names(), tfidf.idf_)))

How I can get tf scores seperately?

Upvotes: 2

Views: 5218

Answers (3)

AliceG
AliceG

Reputation: 21

I think I had the same issue as you.

I wanted to use TfIdfVectorizer but their default tf-idf definition is not standard (tf-idf = tf + tf*idf instead of the normal tf-idf = tf*idf)

TF = the term "frequency" is generally used to mean count. For that you can use CountVectorizer() from sklearn. Need to log transform and normalize if needed.

The option using numpy was much longer in processing time (> 50 times slower).

Upvotes: 0

T. Ray
T. Ray

Reputation: 641

You'll need to do a little more work to compute this.

import numpy as np

df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence', 
                        'This is the second sentence',
                        'This is the third sentence']})

# Tokenize and generate count vectors
word_vec = df.sent.apply(str.split).apply(pd.value_counts).fillna(0)

# Compute term frequencies
tf = word_vec.divide(np.sum(word_vec, axis=1), axis=0)

# Compute inverse document frequencies
idf = np.log10(len(tf) / word_vec[word_vec > 0].count()) 

# Compute TF-IDF vectors
tfidf = np.multiply(tf, idf.to_frame().T)

print(tfidf)

    is  the     first  This  sentence    second     third
0  0.0  0.0  0.095424   0.0       0.0  0.000000  0.000000
1  0.0  0.0  0.000000   0.0       0.0  0.095424  0.000000
2  0.0  0.0  0.000000   0.0       0.0  0.000000  0.095424

Depending on your situation, you may want to normalize:

# L2 (Euclidean) normalization
l2_norm = np.sum(np.sqrt(tfidf), axis=1)

# Normalized TF-IDF vectors
tfidf_norm = (tfidf.T / l2_norm).T

print(tfidf_norm)

    is  the     first  This  sentence    second     third
0  0.0  0.0  0.308908   0.0       0.0  0.000000  0.000000
1  0.0  0.0  0.000000   0.0       0.0  0.308908  0.000000
2  0.0  0.0  0.000000   0.0       0.0  0.000000  0.308908

Upvotes: 3

Ezer K
Ezer K

Reputation: 3739

Here is my solution:

first tokenize, for convenience as a separate column:

df['tokens'] = [x.lower().split() for x in df.sent.values] 

then TF as you did, but with normalize parameter (for technical reasons you need a lambda func):

tf = df.tokens.apply(lambda x: pd.Series(x).value_counts(normalize=True)).fillna(0)

then IDF (one per word in vocabulary):

idf = pd.Series([np.log10(float(df.shape[0])/len([x for x in df.tokens.values if token in x])) for token in tf.columns])
idf.index = tf.columns

then if you want TFIDF:

tfidf = tf.copy()
for col in tfidf.columns:
    tfidf[col] = tfidf[col]*idf[col]

Upvotes: 1

Related Questions