Reputation: 871
I trained a word embedding model using Google's word2vec. The output is a file that contains a word and its vector.
I loaded this trained model in deeplearing4j.
WordVectors vec = WordVectorSerializer.loadTxtVectors(new File("vector.txt"));
Collection<String> lst = vec.wordsNearest("someWord", 10);
But the two lists of similar words obtained from deeplearing4j's package and word2vec's distance function are totally different although I used the same vector file.
Does anyone have a good understanding on how things work in deeplaring4j and where these differences are coming from?
Upvotes: 0
Views: 1231
Reputation: 54183
Are the lists similar at all? Does either set seem more reasonable as similar words?
By my understanding, the lists should match almost exactly - they should be implementing the same calculation on the same input vectors. If they don't, and especially if the original word2vec.c similar-list looks more reasonable, then I would suspect a bug in DL4J.
Looking at the method doing the calculation – https://github.com/deeplearning4j/deeplearning4j/blob/f943ea879ab362f66b57b00754b71fb2ff3677a1/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/wordvectors/WordVectorsImpl.java#L385 :
if (lookupTable() instanceof InMemoryLookupTable) {...}
branch may be correct – I'm not familiar with the nd4j API – but almost seems too ornate for the calculation of ranked cosine-similarity values;getWordVectorMatrix()
instead of getWordVectorMatrixNormalized()
Upvotes: 1
Reputation: 4898
There can be multiple reasons why you are getting different vectors from different implementations (and hence difference in the similar words). I can mention a few
If your number of documents (training data) >> number of unique words (vocabulary size), the vectors for the words will stabilise after few iterations and you can find some of the most similar words from the two, similar.
Upvotes: 0