florient
florient

Reputation: 1

word2vec gensim update learning rate

I trained a w2v model on a big corpus, and I want to update it with a smaller one with new sentences (and new words).

In the first big training, I took the default parameters for alpha (0.025 with lin. decay to 0.0001) Now, I want to use model.train to update it. But from the doc I don't understand which (initial and final) learning rate will be used during this update of training.

From one side, if you also use 0.025 with lin. decay until 0.0001, it will be too strong for already existing words which appeared a lot in the first big corpus and that will be heavily changed, but from the other side for new words (added with model.build_vocab(sentences, update = True)) a low learning rate of 0.0001 is too small.

So my questions are :

  1. What is the default behaviour of model.train in the API on new sentences regarding the learning rate?
  2. How I should choose the learning rate in order to take into account this issue of old/new words ?

  3. [aside question] Why when I use 2 times model.train on the same sentences, the second time, it doesn't update the vectors ?

Upvotes: 0

Views: 2615

Answers (1)

gojomo
gojomo

Reputation: 54173

While you can keep training a Word2Vec model with newer examples, unless the old examples are also re-presented in an interleaved fashion, those new examples may not make the model any better – no matter how well you adjust the alpha.

That's because while training on the new examples, the model is only being nudged to get better at predicting their words, in those new contexts. If there are words missing from the new texts, their word-vectors remain unadjusted as the rest of the model drifts. Even to the extent the same words repeat, their new contexts will presumably be different in some important ways – or else why keep training with new data? – which incrementally dilutes or obsoletes all influence of the older training.

There's even a word for the tendency (but far from certainty) of neural-networks to worsen when presented new data: catastrophic forgetting.

So, the most supportable policy is to re-train with all relevant data mixed-together, to be sure it all has equal influence. If you're improvising some other shortcuts, you're in experimental territory, and there's little reliable documentation or published work that can make strong suggestions about the relative balance of learning-rates/epoch-counts/etc. Any possible answers would also depend very heavily on the relative sizes of the corpuses and vocabularies both at first, and then on any subsequent updates, and also on how important your specific project are factors like vector-stability-over-time, or relative-quality-of-different-vectors. So there'd be no one answer – just what tends to work in your particular setup.

(There's an experimental feature in gensim Word2Vec – some internal model properties that end _lockf. That stands for 'lock-factor'. They match 1-for-1 with the word-vectors, and for any slot where this lock-factor is set to 0.0, the word-vector ignores training updates. In this way, you can essentially 'freeze' some words – such as those you're confident won't be improved by more training – while letting others still update. This might help with drift/forgetting issues with updates, but issues of relative quality and correct alpha/epochs are still murky, requiring project-by-project experimentation.)

Specifically with regard to your numbered questions:

(1) Each call to train() will do the specified epochs number of passes over the data, and smoothly manage the learning-rate from the model's configured starting alpha to min_alpha (unless you override those with extra parameters to train().

(2) As above, there's no established rule-of-thumb because the issue is complicated, incremental training in this style isn't guaranteed to help, and even where it might help it'd depend highly on non-generalizable project specifics.

(3) If a second call to train() causes no changes to vectors, there may be something wrong with your corpus-iterator. Enable logging to at least the INFO level and make sure train() is taking the time, and showing the incremental progress, that indicates real model updates are happening.

Upvotes: 1

Related Questions