Reputation: 717
I am trying to replicate a TFIDF example from this video: Using TF-IDF to convert unstructured text to useful features
As far as I can tell, the code is the same as in the example, except for me using .items (python 3) instead of .iteritems (python 2):
docA = "the cat sat on my face"
docB = "the dog sat on my bed"
bowA = docA.split(" ")
bowB = docB.split(" ")
wordSet= set(bowA).union(set(bowB))
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)
for word in bowA:
wordDictA[word]+=1
for word in bowB:
wordDictB[word]+=1
import pandas as pd
bag = pd.DataFrame([wordDictA, wordDictB])
print(bag)
def computeTF(wordDict,bow):
tfDict = {}
bowCount = len(bow)
for word, count in wordDict.items():
tfDict[word] = count / float(bowCount)
return tfDict
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)
def computeIDF(docList):
import math
idfDict = {}
N = len(docList)
#Count N of docs that contain word w
idfDict = dict.fromkeys(docList[0].keys(),0)
for doc in docList:
for word, val in doc.items():
if val > 0:
idfDict[word] +=1
for word, val in idfDict.items():
idfDict[word] = math.log(N/ float(val))
return idfDict
idfs = computeIDF([wordDictA, wordDictB])
def computeTFIDF(tfBow,idfs):
tfidf = {}
for word, val in tfBow.items():
tfidf[word] = val * idfs[word]
return tfidf
tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)
TF = pd.DataFrame([tfidfBowA, tfidfBowB])
print(TF)
The resulting table should look something like this, where the common words(on, my, sat, the) all have a score of 0:
bed cat dog face my on sat the
0 0.000000 0.115525 0.000000 0.115525 0.000000 0.000000 0.000000 0.000000
1 0.115525 0.000000 0.115525 0.000000 0.000000 0.000000 0.000000 0.000000
But instead my resulting dataframe looks like this, with all words having the same score, except those just occuring in on on of the documents (bed\dog,cat\face):
bed cat dog face my on sat the
0 0.000000 0.020833 0.000000 0.020833 0.020833 0.020833 0.020833 0.020833
1 0.020833 0.000000 0.020833 0.000000 0.020833 0.020833 0.020833 0.020833
if I print(idfs) I get
{'my': 0.0, 'sat': 0.0, 'dog': 0.6931, 'cat': 0.6931, 'on': 0.0, 'the': 0.0, 'face': 0.6931, 'bed': 0.6931}
Here, the words that are included in both docs have the value 0, which will then be used to weigh down their importance, as they are common to all docs. Before the computeTFIDF function is used, the data looks like this:
{'my': 0.1666, 'sat': 0.1666, 'dog': 0.0, 'cat': 0.1666, 'on': 0.1666, 'the': 0.1666, 'face': 0.1666, 'bed': 0.0}
Since the function will multiply the two numbers, "my" (with an idfs of 0) should be 0, and "dog" (with a idfs of 0.6931) should be (0,6931*0,1666 = 0,11), as per the example. Instead, I get the number 0.02083 for all but the words not present in the doc. Is there something other than the syntax for iter\iteritems between python 2 and 3 that is messing up my code?
Upvotes: 1
Views: 424
Reputation: 9081
In the second last part before casting to df
, change these two lines -
tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)
TO -
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)
For computing Tfidf
, you have to call the function computeTFIDF()
instead of computeTF()
Output
tfidfBowA
{'bed': 0.0,
'cat': 0.11552453009332421,
'dog': 0.0,
'face': 0.11552453009332421,
'my': 0.0,
'on': 0.0,
'sat': 0.0,
'the': 0.0}
tfidfBowB
{'bed': 0.11552453009332421,
'cat': 0.0,
'dog': 0.11552453009332421,
'face': 0.0,
'my': 0.0,
'on': 0.0,
'sat': 0.0,
'the': 0.0}
Hope that helps!
Upvotes: 1