Reputation: 20029
I'm trying out the Word2Vec tutorial at tensorflow (see here: https://www.tensorflow.org/tutorials/text/word2vec)
While all seems to work fine, the output is somewhat unexpected to me, especially the small cluster in the PCA. The 'closet' words in the embedding dimension also don't make much sense, especially compared to other examples.
Am I doing something (trivially) wrong? Or is this expected?
For completeness, I run this in the nvidia-docker image, but also found similar results running cpu only.
Here is the projected embedding showing the cluster.
Upvotes: 0
Views: 50
Reputation: 11240
There can be various reasons.
One reason is that this is due to the so-called hubness problem of embedding spaces, which is an artifact of the high-dimensional space. Some words end up close to a large part of the space and act as sort of hubs in the nearest neighbor search, so through these words, you can get quickly from everywhere to everywhere.
Another reason might be that the model is just undertrained for this particular word. Word embeddings are typically trained on very large datasets, such that every word appears in sufficiently many contexts. If a word does not appear frequently enough or in too ambiguous contexts, then it also ends up to be similar to basically everything.
Upvotes: 1