code_lover
code_lover

Reputation: 1

Machine learning Curse of dimensionality

I have word vectors from a word2vec model in 500 dim and 1000dim. I am computing the euclidean distance between some example vectors in 500 and 1000 dim. My problem is that I have read papers about the curse of dimensionality: Euclidean distance does not work in high dimension space. But here the results are quite similar for both dimensions. I computed the euclidean distance between 1000 dim vectors:

distance beween girl and boy 
18.1915241847 
cosine between girl and boy
 0.785652955784 
l1 distance beween girl and boy
 18.1915241847 
distance between girl and neither 
35.549272401 
cosine between girl and neither 
-0.0117403359958 
distance between boy and neither 
34.5523976193
 cosine between boy and neither
 -0.0129663966118 
distance between girl and charger 
28.65625576 
cosine between girl and charger
 0.119322070804 
distance between either and neither 
25.1379275604 
cosine between either and neither
 0.357230346462

In 500 dim it is:

distance between girl and boy 
13.9897543378 
cosine between girl and boy 0.864196148736 
l1 distance between girl and boy 
13.9897543378 
distance between girl and neither 
35.1385895164 
cosine between girl and neither 
-0.000815672156041 
distance between boy and neither
 34.1677078497 
cosine between boy and neither 
0.00703764567668 
distance between girl and charger 
27.689731876 
cosine between girl and charger
 0.113056294897 
distance between either and neither 
0.0 
cosine between either and neither 
1.0 

Can someone explain why this is so? Is it related to sparcity?

Upvotes: 0

Views: 1140

Answers (3)

unki
unki

Reputation: 1024

Based on your original question, I believe you are comparing the distance between word vectors. The curse of dimension simply states that as the dimension increases, we also need more data to compensate the increasing spaces. If you happened to train word2vec with sufficient large enough data, the semantic property between words should still hold.

But your result does not look good. I expect the cosine similarity between 'neither' and 'either' closes to 0.0 since these two words are quite opposite. Can you try to compute the euclidean distance on more obvious words to check the correctness? From the original word2vec website, their examples demonstrates the similarity between word "Paris" and "France".

Good luck!

Upvotes: 0

Alexander Bauer
Alexander Bauer

Reputation: 2054

There is the effect that the difference between minimum and maximum distance of points distributed in high-dimensional space vanishes as the dimensionality goes to infinity. However this effect assumes that vector dimensions are independently and identically distributed. In your case you are still far from infinity, and also word embedding vectors are very likely not identically and independently distributed, so the effect is not that strong.

What you still may notice is that the contrast of the distances in your example of boy vs. girl/neither is smaller for the 1000-dimensional vectors (18 vs. 35) than for the 500-dimensional vector (13 vs. 35). Leaving the distribution assumption aside, this should get worse when increasing the dimension further.

Upvotes: 0

Rodrigo Martínez
Rodrigo Martínez

Reputation: 26

It seems it's not something to do with sparsity. It's more like an attribute or text representation problem. Just check that with 500 dim you're getting almost 100% similarity when calculating cosine between 'neither' and 'either' vectors but a 35% similarity when using 1000 dim. While other data comparations are doing quite the same, this one simple example is different and is saying something with your calculation or representation is wrong. Did you implemented the euclidean distance method or you took it from somewhere? Did you implemented your word2vec model or took it from somewhere?

Upvotes: 1

Related Questions