Peter
Peter

Reputation: 355

Python tfidf returning same values regardless of idf

I am trying to build a small program that calculates the tfidf in python. There are two very nice tutorials which I have used (I have code from here and another function from kaggle )

import nltk
import string
import os
from bs4 import *
import re
from nltk.corpus import stopwords # Import the stop word list
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

path = 'my/path'
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

def review_to_words( raw_review ):
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))  



for subdir, dirs, files in os.walk(path):
    for file in files:
        file_path = subdir + os.path.sep + file
        shakes = open(file_path, 'r')
        text = shakes.read()
        token_dict[file] = review_to_words(text)

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())


str = 'this sentence has unseen text such as computer but also king  lord lord  this this and that lord juliet'#teststring
response = tfidf.transform([str])

feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
    print feature_names[col], ' - ', response[0, col]

The code seems to work fine but then I have a look at the results.

thi  -  0.612372435696
text  -  0.204124145232
sentenc  -  0.204124145232
lord  -  0.612372435696
king  -  0.204124145232
juliet  -  0.204124145232
ha  -  0.204124145232
comput  -  0.204124145232

The IDFs seem to be the same for all the words because the TFIDFs are just n*0.204. I have checked with tfidf.idf_ and this seems to be the case.

Is there something in the method that I have not implemented correctly? Do you know why the idf_s are the same?

Upvotes: 0

Views: 1303

Answers (2)

satojkovic
satojkovic

Reputation: 699

The inverse document frequency of a term t is calculated as follows.

enter image description here

N is the total number of documents and df_t is the number of documents where the term t appears.

In this case, your program has one document (str variable). Therefore, both N and df_t equal 1. As a result, the IDF for all terms are the same.

Upvotes: 1

Rabbit
Rabbit

Reputation: 866

Since you provided a list containing 1 document, all terms idfs will have an equal 'binary frequency'.

idf is the inverted term frequency over the set of documents (or just inverted document frequency). Most if not all idf formulas only checks for term presence in a document, so it does not matter how many times it appears per document.

Try feeding a list with 3 distinct documents for instance, this way the idfs will not be the same.

Upvotes: 1

Related Questions