Rob Audenaerde
Rob Audenaerde

Reputation: 20029

Word2Vec Tensorflow tutorial weird output

I'm trying out the Word2Vec tutorial at tensorflow (see here: https://www.tensorflow.org/tutorials/text/word2vec)

While all seems to work fine, the output is somewhat unexpected to me, especially the small cluster in the PCA. The 'closet' words in the embedding dimension also don't make much sense, especially compared to other examples.

Am I doing something (trivially) wrong? Or is this expected?

For completeness, I run this in the nvidia-docker image, but also found similar results running cpu only.

Here is the projected embedding showing the cluster. enter image description here

Upvotes: 0

Views: 50

Answers (1)

Jindřich
Jindřich

Reputation: 11240

There can be various reasons.

One reason is that this is due to the so-called hubness problem of embedding spaces, which is an artifact of the high-dimensional space. Some words end up close to a large part of the space and act as sort of hubs in the nearest neighbor search, so through these words, you can get quickly from everywhere to everywhere.

Another reason might be that the model is just undertrained for this particular word. Word embeddings are typically trained on very large datasets, such that every word appears in sufficiently many contexts. If a word does not appear frequently enough or in too ambiguous contexts, then it also ends up to be similar to basically everything.

Upvotes: 1

Related Questions