gensim word2vec - update model data

Question

I have an issue similar to the one discussed here - gensim word2vec - updating word embeddings with newcoming data

I have the following code that saves a model as text8_gensim.bin

sentences = word2vec.Text8Corpus('train/text8')
model = word2vec.Word2Vec(sentences, size=200, workers=12, min_count=5,sg=0, window=8, iter=15, sample=1e-4,alpha=0.05,cbow_mean=1,negative=25)
model.save("./savedModel/text8_gensim.bin")

Here is the code that adds more data to the saved model (after loading it)

fname="savedModel/text8_gensim.bin"
model = word2vec.Word2Vec.load(fname)
model.epochs=15

#Custom words
docs = ["start date", "end date", "eft date","termination date"]
model.build_vocab(docs, update=True)
model.train(docs, total_examples=model.corpus_count, epochs=model.epochs)
model.wv.similarity('start','eft')

The model loads fine; however when I try to call model.wv.similarity function I get the following error

KeyError: "word 'eft' not in vocabulary"

Am I missing something here?

gojomo · Accepted Answer

Those docs aren't in the right format: each text should be a list-of-string-tokens, not a string.

And, the same min_count threshold will apply to incremental updates: words less frequent that that threshold will be ignored. (Since a min_count higher than 1 is almost always a good idea, a word that appears only once in any update will never be added to the model.)

Incrementally adding words introduces lots of murky issue with unclear proper choices with regard to model quality, balancing the effects of early-vs-late training, management of the alpha learning-rate, and so forth. It won't necessarily improve your model; with the wrong choices it could make it worse, by adjusting some words with your new texts in ways that move them out-of-compatible-alignment with earlier-batch-only words.

So be careful and always check with a repeatable automated quantitative quality check that your changes are helping. (The safest approach is to retrain with old and new texts in one combined corpus, so that all words get trained against one another equally on all data.)

gensim word2vec - update model data

Answers (1)

Related Questions