Reputation: 1
I have word vectors from a word2vec model in 500 dim and 1000dim. I am computing the euclidean distance between some example vectors in 500 and 1000 dim. My problem is that I have read papers about the curse of dimensionality: Euclidean distance does not work in high dimension space. But here the results are quite similar for both dimensions. I computed the euclidean distance between 1000 dim vectors:
distance beween girl and boy
18.1915241847
cosine between girl and boy
0.785652955784
l1 distance beween girl and boy
18.1915241847
distance between girl and neither
35.549272401
cosine between girl and neither
-0.0117403359958
distance between boy and neither
34.5523976193
cosine between boy and neither
-0.0129663966118
distance between girl and charger
28.65625576
cosine between girl and charger
0.119322070804
distance between either and neither
25.1379275604
cosine between either and neither
0.357230346462
In 500 dim it is:
distance between girl and boy
13.9897543378
cosine between girl and boy 0.864196148736
l1 distance between girl and boy
13.9897543378
distance between girl and neither
35.1385895164
cosine between girl and neither
-0.000815672156041
distance between boy and neither
34.1677078497
cosine between boy and neither
0.00703764567668
distance between girl and charger
27.689731876
cosine between girl and charger
0.113056294897
distance between either and neither
0.0
cosine between either and neither
1.0
Can someone explain why this is so? Is it related to sparcity?
Upvotes: 0
Views: 1140
Reputation: 1024
Based on your original question, I believe you are comparing the distance between word vectors. The curse of dimension simply states that as the dimension increases, we also need more data to compensate the increasing spaces. If you happened to train word2vec with sufficient large enough data, the semantic property between words should still hold.
But your result does not look good. I expect the cosine similarity between 'neither' and 'either' closes to 0.0 since these two words are quite opposite. Can you try to compute the euclidean distance on more obvious words to check the correctness? From the original word2vec website, their examples demonstrates the similarity between word "Paris" and "France".
Good luck!
Upvotes: 0
Reputation: 2054
There is the effect that the difference between minimum and maximum distance of points distributed in high-dimensional space vanishes as the dimensionality goes to infinity. However this effect assumes that vector dimensions are independently and identically distributed. In your case you are still far from infinity, and also word embedding vectors are very likely not identically and independently distributed, so the effect is not that strong.
What you still may notice is that the contrast of the distances in your example of boy vs. girl/neither is smaller for the 1000-dimensional vectors (18 vs. 35) than for the 500-dimensional vector (13 vs. 35). Leaving the distribution assumption aside, this should get worse when increasing the dimension further.
Upvotes: 0
Reputation: 26
It seems it's not something to do with sparsity. It's more like an attribute or text representation problem. Just check that with 500 dim you're getting almost 100% similarity when calculating cosine between 'neither' and 'either' vectors but a 35% similarity when using 1000 dim. While other data comparations are doing quite the same, this one simple example is different and is saying something with your calculation or representation is wrong. Did you implemented the euclidean distance method or you took it from somewhere? Did you implemented your word2vec model or took it from somewhere?
Upvotes: 1