Ion C
Ion C

Reputation: 323

How to Export Gensim Word2Vec Model with Ngram Weights for DL4J?

I'm quite new to nlp. I'm trying to use a model trained with gensim in dl4j. I'm saving the model with

w2v_model.wv.save_word2vec_format("path/to/w2v_model.bin", binary=True)

and afterwards I'm loading it with

Word2Vec w2vModel = WordVectorSerializer.readWord2VecModel("path/to/w2v_model.bin"); 

The model works well except for the handling of out-of-vocabulary (OOV) words. In Gensim, it seems to calculate vectors for OOV words based on the word's n-grams, but in DL4J, it provides an empty vector for them.

My questions are:

  1. Is there a way to export the n-gram weights along with the model from Gensim so that DL4J can use them?
  2. If exporting the n-gram weights is not possible, is there a method to reconstruct them on the DL4J side to achieve similar results for OOV words as in Gensim?

Any guidance or suggestions would be greatly appreciated

Upvotes: 0

Views: 100

Answers (1)

gojomo
gojomo

Reputation: 54208

The core original word2vec algorithm – and the Word2Vec model class in Gensim – has no ability to synthesize vectors for OOV words using character n-grams.

That's only a feature of FastText models (and the FastText model class in Gensim) – so if you're seeing that working in Gensim, your w2v_model variable may actually hold a Gensim FastText object.

Further, the plain {word, vector}-per-line format saved by Gensim's .save_word2vec_format() (whether binary=False or binary=True) doesn't save any subword n-grams, even if used on a FastText object. (It just saves the full-word vectors for in-vocabulary words.)

Gensim's FastText can save models in the full raw model format also used by Facebook's original FastText implementation – see FastText.save_facebook_model().

But to bring that to a Java environment, you'd need to find a true FastText implementation that also reads that format. I don't see any evidence that the Word2Vec class in DL4J supports FastText features or load FastText models.

There is an org.deeplearning4j.models.fasttext.FastText class – which seems to wrap the Facebook native C++ FastText implementation via another com.github.jfasttext.JFastText. That is, it's not a true Java implementation, but it makes the model accessible to Java code.

I have no idea of the completeness/reliability of this approach; it's a little fishy to me that a class (JFastText) not from a Github engineer is named via a com.github path, but presumably the deeplearning4j maintainers know what they're doing, and this may be your best option for loading a fully-capable (character-n-gram features) FastText model for use in DL4J.

Upvotes: 0

Related Questions