Reputation: 323
I'm quite new to nlp. I'm trying to use a model trained with gensim in dl4j. I'm saving the model with
w2v_model.wv.save_word2vec_format("path/to/w2v_model.bin", binary=True)
and afterwards I'm loading it with
Word2Vec w2vModel = WordVectorSerializer.readWord2VecModel("path/to/w2v_model.bin");
The model works well except for the handling of out-of-vocabulary (OOV) words. In Gensim, it seems to calculate vectors for OOV words based on the word's n-grams, but in DL4J, it provides an empty vector for them.
My questions are:
Any guidance or suggestions would be greatly appreciated
Upvotes: 0
Views: 100
Reputation: 54208
The core original word2vec algorithm – and the Word2Vec
model class in Gensim – has no ability to synthesize vectors for OOV words using character n-grams.
That's only a feature of FastText models (and the FastText
model class in Gensim) – so if you're seeing that working in Gensim, your w2v_model
variable may actually hold a Gensim FastText
object.
Further, the plain {word, vector}-per-line format saved by Gensim's .save_word2vec_format()
(whether binary=False
or binary=True
) doesn't save any subword n-grams, even if used on a FastText
object. (It just saves the full-word vectors for in-vocabulary words.)
Gensim's FastText
can save models in the full raw model format also used by Facebook's original FastText implementation – see FastText.save_facebook_model()
.
But to bring that to a Java environment, you'd need to find a true FastText implementation that also reads that format. I don't see any evidence that the Word2Vec
class in DL4J supports FastText features or load FastText models.
There is an org.deeplearning4j.models.fasttext.FastText
class – which seems to wrap the Facebook native C++ FastText implementation via another com.github.jfasttext.JFastText
. That is, it's not a true Java implementation, but it makes the model accessible to Java code.
I have no idea of the completeness/reliability of this approach; it's a little fishy to me that a class (JFastText
) not from a Github engineer is named via a com.github
path, but presumably the deeplearning4j
maintainers know what they're doing, and this may be your best option for loading a fully-capable (character-n-gram features) FastText model for use in DL4J.
Upvotes: 0