lvcasco
lvcasco

Reputation: 45

Right way to calculate the cosine similarity of two word-frequency-dictionaries in python?

I'm trying to iterate through a file containing text and calculate the cosine similarity between the current line and a query the user raised. I have already tokenized the query and the line and saved the union of their words into a set.

Example:

line_tokenized = ['Karl', 'Donald', 'Ifwerson']

query_tokenized = ['Donald', 'Trump']

word_set = ['Karl', 'Donald', 'Ifwerson', 'Trump']

Now I have to create a dictionary each for the line and the query, containing word-frequency pairs. I thought about something ike this:

line_dict = {'Karl': 1, 'Donald': 1, 'Ifwerson': 1, 'Trump': 0}
query_dict = {'Karl': 0, 'Donald': 1, 'Ifwerson': 0, 'Trump': 1}

But the cosine similarity won't be calculated properly as the key-value pairs are unordered. I came across OrderedDict(), but I don't understand how to implement some things as it's elements are stored as tuples:

So my questions are:

Upvotes: 3

Views: 4693

Answers (2)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476557

You do not need to order the dictionary for Cosine similarity, simple lookup is sufficient:

import math

def cosine_dic(dic1,dic2):
    numerator = 0
    dena = 0
    for key1,val1 in dic1.items():
        numerator += val1*dic2.get(key1,0.0)
        dena += val1*val1
    denb = 0
    for val2 in dic2.values():
        denb += val2*val2
    return numerator/math.sqrt(dena*denb)

you simply use a .get(key1,0.0) to lookup of the element exists and if it does not 0.0 is assumed. As a result both dic1 and dic2 do not need to store values with 0 as value.

To answer your additional questions:

How can I set the key-value pairs and have access to them afterwards?

You simply state:

dic[key] = value

How can I increment the value of a certain key?

If you know for sure that the key is already part of the dictionary:

dic[key] +=  1

otherwise you can use:

dic[key] = dic.get(key,0)+1

Or is there any other more easier way to do this?

You can use a Counter which is basically a dictionary with some added functionality.

Upvotes: 3

bluesummers
bluesummers

Reputation: 12607

Using pandas and scipy

import pandas as pd
from scipy.spatial.distance import cosine

line_dict = {'Karl': 1, 'Donald': 1, 'Ifwerson': 1, 'Trump': 0}
query_dict = {'Karl': 0, 'Donald': 1, 'Ifwerson': 0, 'Trump': 1}

line_s = pd.Series(line_dict)
query_s = pd.Series(query_dict)

print(1 - cosine(line_s, query_s))

This code will output 0.40824829046386291

I didn't understand what you meant by "order" so I haven't dealt with that, but this code should be a good start for you.

Upvotes: 1

Related Questions