Calculate euclidean distance between two vector (bag of words) in python

Question

I use a a dictionary to represent word count in a article

For example {"name" : 2 , "your": 10, "me", 20} to represent that "name" appears twice, "your" appears 10 times and "me" appears 20 times.

So, is there a good way to calculate the euclidean distance of these vectors? The difficulty is that these vectors are in different lengths and some vectors contains certain words while some do not.

I know I sure can write a long function to do so, just look for a simpler and cleverer way. Thanks

Edit: The objective is to get the similarity between two article and group them

Blubber · Accepted Answer

Something like

math.sqrt(sum((a[k] - b[k])**2 for k in a.keys()))

Where a and b are dictionaries with the same keys. If you are going to compare these values between different pairs of vectors then you should make sure that each vector contains exactly the same words, otherwise your distance measure is going to mean nothing at all.

You could calculate the distance based on the intersection alone:

math.sqrt(sum((a[k] - b[k])**2 for k in set(a.keys()).intersection(set(b.keys()))))

Another option is to use the union and set unknown values to 0

math.sqrt(sum((a.get(k, 0) - b.get(k, 0))**2 for k in set(a.keys()).union(set(b.keys()))))

But you have to carefully think about what it actually is that you are calculating.

Calculate euclidean distance between two vector (bag of words) in python

Answers (2)

Related Questions