Reputation: 5152
I use a a dictionary to represent word count in a article
For example {"name" : 2 , "your": 10, "me", 20}
to represent that "name" appears twice, "your" appears 10 times and "me" appears 20 times.
So, is there a good way to calculate the euclidean distance of these vectors? The difficulty is that these vectors are in different lengths and some vectors contains certain words while some do not.
I know I sure can write a long function to do so, just look for a simpler and cleverer way. Thanks
Edit: The objective is to get the similarity between two article and group them
Upvotes: 4
Views: 9973
Reputation: 146
You can also use cosine similarity between two vectors as in this link: http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/sphilip/cos.html
Upvotes: 0
Reputation: 2254
Something like
math.sqrt(sum((a[k] - b[k])**2 for k in a.keys()))
Where a and b are dictionaries with the same keys. If you are going to compare these values between different pairs of vectors then you should make sure that each vector contains exactly the same words, otherwise your distance measure is going to mean nothing at all.
You could calculate the distance based on the intersection alone:
math.sqrt(sum((a[k] - b[k])**2 for k in set(a.keys()).intersection(set(b.keys()))))
Another option is to use the union and set unknown values to 0
math.sqrt(sum((a.get(k, 0) - b.get(k, 0))**2 for k in set(a.keys()).union(set(b.keys()))))
But you have to carefully think about what it actually is that you are calculating.
Upvotes: 9