Reputation: 355
I am trying to build a small program that calculates the tfidf in python. There are two very nice tutorials which I have used (I have code from here and another function from kaggle )
import nltk
import string
import os
from bs4 import *
import re
from nltk.corpus import stopwords # Import the stop word list
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
path = 'my/path'
token_dict = {}
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
def review_to_words( raw_review ):
# 1. Remove HTML
review_text = BeautifulSoup(raw_review).get_text()
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
# 4. In Python, searching a set is much faster than searching
# a list, so convert the stop words to a set
stops = set(stopwords.words("english"))
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = subdir + os.path.sep + file
shakes = open(file_path, 'r')
text = shakes.read()
token_dict[file] = review_to_words(text)
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())
str = 'this sentence has unseen text such as computer but also king lord lord this this and that lord juliet'#teststring
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print feature_names[col], ' - ', response[0, col]
The code seems to work fine but then I have a look at the results.
thi - 0.612372435696
text - 0.204124145232
sentenc - 0.204124145232
lord - 0.612372435696
king - 0.204124145232
juliet - 0.204124145232
ha - 0.204124145232
comput - 0.204124145232
The IDFs seem to be the same for all the words because the TFIDFs are just n*0.204. I have checked with tfidf.idf_
and this seems to be the case.
Is there something in the method that I have not implemented correctly? Do you know why the idf_s are the same?
Upvotes: 0
Views: 1303
Reputation: 699
The inverse document frequency of a term t is calculated as follows.
N is the total number of documents and df_t is the number of documents where the term t appears.
In this case, your program has one document (str variable). Therefore, both N and df_t equal 1. As a result, the IDF for all terms are the same.
Upvotes: 1
Reputation: 866
Since you provided a list containing 1 document, all terms idfs will have an equal 'binary frequency'.
idf is the inverted term frequency over the set of documents (or just inverted document frequency). Most if not all idf formulas only checks for term presence in a document, so it does not matter how many times it appears per document.
Try feeding a list with 3 distinct documents for instance, this way the idfs will not be the same.
Upvotes: 1