Trying to replicate TFIDF example, multiplication returns wrong number

Question

I am trying to replicate a TFIDF example from this video: Using TF-IDF to convert unstructured text to useful features

As far as I can tell, the code is the same as in the example, except for me using .items (python 3) instead of .iteritems (python 2):

docA = "the cat sat on my face"
docB = "the dog sat on my bed"

bowA = docA.split(" ")
bowB = docB.split(" ")

wordSet= set(bowA).union(set(bowB))

wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

for word in bowA:
        wordDictA[word]+=1

for word in bowB:
        wordDictB[word]+=1

import pandas as pd

bag = pd.DataFrame([wordDictA, wordDictB])

print(bag)

def computeTF(wordDict,bow):
        tfDict = {}
        bowCount = len(bow)
        for word, count in wordDict.items():
                tfDict[word] = count / float(bowCount)
        return tfDict

tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

def computeIDF(docList):
        import math
        idfDict = {}
        N = len(docList)
        #Count N of docs that contain word w
        idfDict = dict.fromkeys(docList[0].keys(),0)
        for doc in docList:
                for word, val in doc.items():
                        if val > 0:
                                idfDict[word] +=1
        for word, val in idfDict.items():
                idfDict[word] = math.log(N/ float(val))
        return idfDict

idfs = computeIDF([wordDictA, wordDictB])

def computeTFIDF(tfBow,idfs):
        tfidf = {}
        for word, val in tfBow.items():
                tfidf[word] = val * idfs[word]
        return tfidf

tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)

TF = pd.DataFrame([tfidfBowA, tfidfBowB])

print(TF)

The resulting table should look something like this, where the common words(on, my, sat, the) all have a score of 0:

         bed       cat       dog      face        my        on       sat       the   
0  0.000000  0.115525  0.000000  0.115525  0.000000  0.000000  0.000000  0.000000   
1  0.115525  0.000000  0.115525  0.000000  0.000000  0.000000  0.000000  0.000000

But instead my resulting dataframe looks like this, with all words having the same score, except those just occuring in on on of the documents (bed\dog,cat\face):

         bed       cat       dog      face        my        on       sat       the   
0  0.000000  0.020833  0.000000  0.020833  0.020833  0.020833  0.020833  0.020833   
1  0.020833  0.000000  0.020833  0.000000  0.020833  0.020833  0.020833  0.020833

if I print(idfs) I get

{'my': 0.0, 'sat': 0.0, 'dog': 0.6931, 'cat': 0.6931, 'on': 0.0, 'the': 0.0, 'face': 0.6931, 'bed': 0.6931}

Here, the words that are included in both docs have the value 0, which will then be used to weigh down their importance, as they are common to all docs. Before the computeTFIDF function is used, the data looks like this:

{'my': 0.1666, 'sat': 0.1666, 'dog': 0.0, 'cat': 0.1666, 'on': 0.1666, 'the': 0.1666, 'face': 0.1666, 'bed': 0.0}

Since the function will multiply the two numbers, "my" (with an idfs of 0) should be 0, and "dog" (with a idfs of 0.6931) should be (0,6931*0,1666 = 0,11), as per the example. Instead, I get the number 0.02083 for all but the words not present in the doc. Is there something other than the syntax for iter\iteritems between python 2 and 3 that is messing up my code?

Vivek Kalyanarangan · Accepted Answer

In the second last part before casting to df, change these two lines -

tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)

TO -

tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

For computing Tfidf, you have to call the function computeTFIDF() instead of computeTF()

Output

tfidfBowA
{'bed': 0.0,
 'cat': 0.11552453009332421,
 'dog': 0.0,
 'face': 0.11552453009332421,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

tfidfBowB
{'bed': 0.11552453009332421,
 'cat': 0.0,
 'dog': 0.11552453009332421,
 'face': 0.0,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

Hope that helps!

Trying to replicate TFIDF example, multiplication returns wrong number

Answers (1)

Related Questions