Ullsokk
Ullsokk

Reputation: 717

Trying to replicate TFIDF example, multiplication returns wrong number

I am trying to replicate a TFIDF example from this video: Using TF-IDF to convert unstructured text to useful features

As far as I can tell, the code is the same as in the example, except for me using .items (python 3) instead of .iteritems (python 2):

docA = "the cat sat on my face"
docB = "the dog sat on my bed"

bowA = docA.split(" ")
bowB = docB.split(" ")

wordSet= set(bowA).union(set(bowB))

wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

for word in bowA:
        wordDictA[word]+=1

for word in bowB:
        wordDictB[word]+=1

import pandas as pd

bag = pd.DataFrame([wordDictA, wordDictB])

print(bag)

def computeTF(wordDict,bow):
        tfDict = {}
        bowCount = len(bow)
        for word, count in wordDict.items():
                tfDict[word] = count / float(bowCount)
        return tfDict

tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

def computeIDF(docList):
        import math
        idfDict = {}
        N = len(docList)
        #Count N of docs that contain word w
        idfDict = dict.fromkeys(docList[0].keys(),0)
        for doc in docList:
                for word, val in doc.items():
                        if val > 0:
                                idfDict[word] +=1
        for word, val in idfDict.items():
                idfDict[word] = math.log(N/ float(val))
        return idfDict

idfs = computeIDF([wordDictA, wordDictB])

def computeTFIDF(tfBow,idfs):
        tfidf = {}
        for word, val in tfBow.items():
                tfidf[word] = val * idfs[word]
        return tfidf

tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)

TF = pd.DataFrame([tfidfBowA, tfidfBowB])

print(TF)

The resulting table should look something like this, where the common words(on, my, sat, the) all have a score of 0:

         bed       cat       dog      face        my        on       sat       the   
0  0.000000  0.115525  0.000000  0.115525  0.000000  0.000000  0.000000  0.000000   
1  0.115525  0.000000  0.115525  0.000000  0.000000  0.000000  0.000000  0.000000 

But instead my resulting dataframe looks like this, with all words having the same score, except those just occuring in on on of the documents (bed\dog,cat\face):

         bed       cat       dog      face        my        on       sat       the   
0  0.000000  0.020833  0.000000  0.020833  0.020833  0.020833  0.020833  0.020833   
1  0.020833  0.000000  0.020833  0.000000  0.020833  0.020833  0.020833  0.020833 

if I print(idfs) I get

{'my': 0.0, 'sat': 0.0, 'dog': 0.6931, 'cat': 0.6931, 'on': 0.0, 'the': 0.0, 'face': 0.6931, 'bed': 0.6931}

Here, the words that are included in both docs have the value 0, which will then be used to weigh down their importance, as they are common to all docs. Before the computeTFIDF function is used, the data looks like this:

{'my': 0.1666, 'sat': 0.1666, 'dog': 0.0, 'cat': 0.1666, 'on': 0.1666, 'the': 0.1666, 'face': 0.1666, 'bed': 0.0}

Since the function will multiply the two numbers, "my" (with an idfs of 0) should be 0, and "dog" (with a idfs of 0.6931) should be (0,6931*0,1666 = 0,11), as per the example. Instead, I get the number 0.02083 for all but the words not present in the doc. Is there something other than the syntax for iter\iteritems between python 2 and 3 that is messing up my code?

Upvotes: 1

Views: 424

Answers (1)

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

In the second last part before casting to df, change these two lines -

tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)

TO -

tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

For computing Tfidf, you have to call the function computeTFIDF() instead of computeTF()

Output

tfidfBowA
{'bed': 0.0,
 'cat': 0.11552453009332421,
 'dog': 0.0,
 'face': 0.11552453009332421,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

tfidfBowB
{'bed': 0.11552453009332421,
 'cat': 0.0,
 'dog': 0.11552453009332421,
 'face': 0.0,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

Hope that helps!

Upvotes: 1

Related Questions