How to get TF-IDF value of a word from all set of documents?

I need a TF-IDF value for a word that is found in number of documents and not only a single document or a specific document.

For example, Consider this corpus corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', 'Is this the second cow?, why is it blue?', ]

I want to get TD-IDF value for word 'FIRST' which is in document 1 and 4. TF-IDF value is calculated on basis of that specific document, in this case I will get 2 score for both indiviual document. However, I need a single score for word 'FIRST' considering all documents at same time.

Is there any way I can get score TF-IDF score of a word from all set of documents? Is there any other method or technique which can help me solve the problem?

Upvotes: 2

Answers (3)

Soumya

Reputation: 431

Thanks @maaniB for the answer.

@Milan - May be you can try the below method/code to get the TF-IDF value for individual document.

The easier way would be to get the feature names and get the sum of sparce array and to create a DataFrame out of it.

Code as follows:

mylist = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'Is this the second cow?, why is it blue?']


df = pd.DataFrame({"texts": mylist})
tfidf_vectorizer = TfidfVectorizer(ngram_range=[1, 1])
tfidf_separate = tfidf_vectorizer.fit_transform(df["texts"])


word_lst = tfidf_vectorizer.get_feature_names()
count_lst = tfidf_separate.toarray().sum(axis=0)

vocab_df = pd.DataFrame((zip(word_lst,count_lst)),
                          columns= ["vocab","tfidf_value"])

vocab_df.sort_values(by="tfidf_value",ascending=False)
print(vocab_df)

vocab     tfidf_value
0 and       0.521203
1 blue      0.407798
2 cow       0.407798
3 document  1.761324
4 first     1.209230
5 is        1.620686
6 it        0.407798
7 one       0.521203
8 second    0.785317
9 the       1.426368
10 third    0.521203
11 this     1.426368
12 why      0.407798

Hope it helps !!

Upvotes: 2

ygorg

Reputation: 770

tl;dr

Tf-Idf is not made to weight words. You cannot compute the Tf-Idf of a word. You can compute the frequency of a word in a corpus.

What is TfIdf

The Tf-Idf computes the score for a word according to a document ! It gives high scores to words that are frequent (TF) and particular (IDF) to a document. TF-IDF's goal is to compute similarity between documents, not weighting words.

The solution given by maaniB is essentially just the normalized frequency of words. Depending on what you need to accomplish you should find an other metric to weigh words (the frequency is generally a great start).

We can see that the Tf-Idf gives a better score to 'cow' in doc 5 because 'cow' is particular to this document but this is lost in maaniB's solution.

Example

For example we will compare the Tf-Idf of 'cow' and 'is'. TF-IDF formula is (without logs): Tf * N / Df. N is the number of documents, Tf the frequency of word in document and Df the number of document in which word appear.

'is' appears in every document so it's Df will be 5. It appears once in documents 1, 2, 3 and 4 so the Tf will be 1 and twice in doc 5. So the TF-IDF of 'is' in doc 1,2,3,4 will be 1 * 5 / 5 = 1; and in doc 5 it will be 2 * 5 / 5 = 2.

'cow' appears only in the 5th document so it's Df is 1. It appears once in document 5 so it's Tf is 1. So the TF-IDF of 'cow' in doc 5 will be 1 * 5 / 1 = 5; and in every other doc : 0 * 5 / 1 = 0.

In conclusion 'is' is very frequent in doc 5 (appears twice) but not particular to doc 5 (appears in every document) so it's Tf-Idf is lower than the one of 'cow' which appear only once but in only one document !

Upvotes: 1

maaniB

Reputation: 605

I think you can join your documents and recalculate the TF-IDF score.

I think your current implementation is:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

mylist = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'Is this the second cow?, why is it blue?',
]
df = pd.DataFrame({"texts": mylist})
tfidf_vectorizer = TfidfVectorizer(ngram_range=[1, 1])
tfidf_separate = tfidf_vectorizer.fit_transform(df["texts"])

df_tfidf = pd.DataFrame(
    tfidf_separate.toarray(), columns=tfidf_vectorizer.get_feature_names(), index=df.index
)
df_tfidf
        and      blue       cow  document     first        is        it       one    second       the     third      this       why
0  0.000000  0.000000  0.000000  0.501885  0.604615  0.357096  0.000000  0.000000  0.000000  0.357096  0.000000  0.357096  0.000000
1  0.000000  0.000000  0.000000  0.757554  0.000000  0.269503  0.000000  0.000000  0.456308  0.269503  0.000000  0.269503  0.000000
2  0.521203  0.000000  0.000000  0.000000  0.000000  0.248356  0.000000  0.521203  0.000000  0.248356  0.521203  0.248356  0.000000
3  0.000000  0.000000  0.000000  0.501885  0.604615  0.357096  0.000000  0.000000  0.000000  0.357096  0.000000  0.357096  0.000000
4  0.000000  0.407798  0.407798  0.000000  0.000000  0.388636  0.407798  0.000000  0.329009  0.194318  0.000000  0.194318  0.407798

If you join your documents:

total = [' '.join(mylist)]
df2 = pd.DataFrame({"texts": total})
tfidf_total = tfidf_vectorizer.fit_transform(df2["texts"])
df_tfidf2 = pd.DataFrame(
    tfidf_total.toarray(), columns=tfidf_vectorizer.get_feature_names(), index=df2.index
)
df_tfidf2
       and     blue      cow  document   first      is       it      one  second      the    third     this      why
0  0.09245  0.09245  0.09245    0.3698  0.1849  0.5547  0.09245  0.09245  0.1849  0.46225  0.09245  0.46225  0.09245