Reputation: 45
I'm trying to iterate through a file containing text and calculate the cosine similarity between the current line and a query the user raised. I have already tokenized the query and the line and saved the union of their words into a set.
Example:
line_tokenized = ['Karl', 'Donald', 'Ifwerson']
query_tokenized = ['Donald', 'Trump']
word_set = ['Karl', 'Donald', 'Ifwerson', 'Trump']
Now I have to create a dictionary each for the line and the query, containing word-frequency pairs. I thought about something ike this:
line_dict = {'Karl': 1, 'Donald': 1, 'Ifwerson': 1, 'Trump': 0}
query_dict = {'Karl': 0, 'Donald': 1, 'Ifwerson': 0, 'Trump': 1}
But the cosine similarity won't be calculated properly as the key-value pairs are unordered. I came across OrderedDict()
, but I don't understand how to implement some things as it's elements are stored as tuples:
So my questions are:
Upvotes: 3
Views: 4693
Reputation: 476557
You do not need to order the dictionary for Cosine similarity, simple lookup is sufficient:
import math
def cosine_dic(dic1,dic2):
numerator = 0
dena = 0
for key1,val1 in dic1.items():
numerator += val1*dic2.get(key1,0.0)
dena += val1*val1
denb = 0
for val2 in dic2.values():
denb += val2*val2
return numerator/math.sqrt(dena*denb)
you simply use a .get(key1,0.0)
to lookup of the element exists and if it does not 0.0
is assumed. As a result both dic1
and dic2
do not need to store values with 0
as value.
To answer your additional questions:
How can I set the key-value pairs and have access to them afterwards?
You simply state:
dic[key] = value
How can I increment the value of a certain key?
If you know for sure that the key is already part of the dictionary:
dic[key] += 1
otherwise you can use:
dic[key] = dic.get(key,0)+1
Or is there any other more easier way to do this?
You can use a Counter
which is basically a dictionary with some added functionality.
Upvotes: 3
Reputation: 12607
Using pandas
and scipy
import pandas as pd
from scipy.spatial.distance import cosine
line_dict = {'Karl': 1, 'Donald': 1, 'Ifwerson': 1, 'Trump': 0}
query_dict = {'Karl': 0, 'Donald': 1, 'Ifwerson': 0, 'Trump': 1}
line_s = pd.Series(line_dict)
query_s = pd.Series(query_dict)
print(1 - cosine(line_s, query_s))
This code will output 0.40824829046386291
I didn't understand what you meant by "order" so I haven't dealt with that, but this code should be a good start for you.
Upvotes: 1