pandagrammer
pandagrammer

Reputation: 871

Getting different results from deeplearning4j and word2vec

I trained a word embedding model using Google's word2vec. The output is a file that contains a word and its vector.

I loaded this trained model in deeplearing4j.

    WordVectors vec = WordVectorSerializer.loadTxtVectors(new File("vector.txt"));
    Collection<String> lst = vec.wordsNearest("someWord", 10);

But the two lists of similar words obtained from deeplearing4j's package and word2vec's distance function are totally different although I used the same vector file.

Does anyone have a good understanding on how things work in deeplaring4j and where these differences are coming from?

Upvotes: 0

Views: 1231

Answers (2)

gojomo
gojomo

Reputation: 54183

Are the lists similar at all? Does either set seem more reasonable as similar words?

By my understanding, the lists should match almost exactly - they should be implementing the same calculation on the same input vectors. If they don't, and especially if the original word2vec.c similar-list looks more reasonable, then I would suspect a bug in DL4J.

Looking at the method doing the calculation – https://github.com/deeplearning4j/deeplearning4j/blob/f943ea879ab362f66b57b00754b71fb2ff3677a1/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/wordvectors/WordVectorsImpl.java#L385 :

  • the code for the if (lookupTable() instanceof InMemoryLookupTable) {...} branch may be correct – I'm not familiar with the nd4j API – but almost seems too ornate for the calculation of ranked cosine-similarity values;
  • the fallback case that follows does not appear to use unit-vector normalized vector values (as would be usual) – it uses getWordVectorMatrix() instead of getWordVectorMatrixNormalized()

Upvotes: 1

kampta
kampta

Reputation: 4898

There can be multiple reasons why you are getting different vectors from different implementations (and hence difference in the similar words). I can mention a few

  • random initialisation of vectors
  • negative sampling
  • threading

If your number of documents (training data) >> number of unique words (vocabulary size), the vectors for the words will stabilise after few iterations and you can find some of the most similar words from the two, similar.

Upvotes: 0

Related Questions