Bear
Bear

Reputation: 5152

Calculate euclidean distance between two vector (bag of words) in python

I use a a dictionary to represent word count in a article

For example {"name" : 2 , "your": 10, "me", 20} to represent that "name" appears twice, "your" appears 10 times and "me" appears 20 times.

So, is there a good way to calculate the euclidean distance of these vectors? The difficulty is that these vectors are in different lengths and some vectors contains certain words while some do not.

I know I sure can write a long function to do so, just look for a simpler and cleverer way. Thanks

Edit: The objective is to get the similarity between two article and group them

Upvotes: 4

Views: 9973

Answers (2)

G.Ahmed
G.Ahmed

Reputation: 146

You can also use cosine similarity between two vectors as in this link: http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/sphilip/cos.html

Upvotes: 0

Blubber
Blubber

Reputation: 2254

Something like

math.sqrt(sum((a[k] - b[k])**2 for k in a.keys()))

Where a and b are dictionaries with the same keys. If you are going to compare these values between different pairs of vectors then you should make sure that each vector contains exactly the same words, otherwise your distance measure is going to mean nothing at all.

You could calculate the distance based on the intersection alone:

math.sqrt(sum((a[k] - b[k])**2 for k in set(a.keys()).intersection(set(b.keys()))))

Another option is to use the union and set unknown values to 0

math.sqrt(sum((a.get(k, 0) - b.get(k, 0))**2 for k in set(a.keys()).union(set(b.keys()))))

But you have to carefully think about what it actually is that you are calculating.

Upvotes: 9

Related Questions