Aviade
Aviade

Reputation: 2097

Gensim Word2vec model is not converged

I'm training a Word2vec model using Gensim Word2vec on a well-known Wikipedia dump provided by Tobias Schnabel in the following link: http://www.cs.cornell.edu/~schnabts/eval/index.html (about 4GB).

I would like to understand how many epochs I should run the model for training until the model will be converged.

I added the following code:

 model = Word2Vec(size=self._number_of_dimensions_in_hidden_layer,
                    window=self._window_size,
                    min_count=3,
                    max_vocab_size=self._max_vocabulary_size,
                    sg=self._use_cbow,
                    seed=model_seed,
                    compute_loss=True,
                    iter=self._epochs)
    model.build_vocab(sentences)

    learning_rate = 0.025
    step_size = (learning_rate - 0.001) / self._epochs

    for i in range(self._epochs):
        end_lr = learning_rate - step_size
        trained_word_count, raw_word_count = model.train(sentences, compute_loss=True,
                                                         start_alpha=learning_rate,
                                                         end_alpha=learning_rate,
                                                         total_examples=model.corpus_count,
                                                         epochs=1)
        loss = model.get_latest_training_loss()
        print("iter={0}, loss={1}, learning_rate={2}".format(i, loss, learning_rate))
        learning_rate  *= 0.6


    model.save(model_name_path)

However I cannot see the model is converging:

iter=0, loss=76893000.0, learning_rate=0.025
iter=1, loss=74870528.0, learning_rate=0.015
iter=2, loss=73959232.0, learning_rate=0.009
iter=3, loss=73605400.0, 
learning_rate=0.005399999999999999
iter=4, loss=73224288.0, 
learning_rate=0.0032399999999999994
iter=5, loss=73008048.0, 
learning_rate=0.0019439999999999995
iter=6, loss=72935888.0, 
learning_rate=0.0011663999999999997
iter=7, loss=72774304.0, 
learning_rate=0.0006998399999999999
iter=8, loss=72642072.0, 
learning_rate=0.0004199039999999999
iter=9, loss=72624384.0, 
learning_rate=0.00025194239999999993
iter=10, loss=72700064.0, 
learning_rate=0.00015116543999999996
iter=11, loss=72478656.0, 
learning_rate=9.069926399999997e-05
iter=12, loss=72486744.0, 
learning_rate=5.441955839999998e-05
iter=13, loss=72282776.0, 
learning_rate=3.2651735039999986e-05
iter=14, loss=71841968.0, 
learning_rate=1.9591041023999992e-05
iter=15, loss=72119848.0, 
learning_rate=1.1754624614399995e-05
iter=16, loss=72054544.0, 
learning_rate=7.0527747686399965e-06
iter=17, loss=71958888.0, 
learning_rate=4.2316648611839976e-06
iter=18, loss=71933808.0, 
learning_rate=2.5389989167103985e-06
iter=19, loss=71739256.0, 
learning_rate=1.523399350026239e-06
iter=20, loss=71660288.0, 
learning_rate=9.140396100157433e-07

I don't undersatnd why the loss function result is not reducing and stay quite constant around 71M.

Upvotes: 1

Views: 1324

Answers (1)

gojomo
gojomo

Reputation: 54173

The model is converged when the loss over a full epoch stops improving. There's no guarantee loss will get arbitrarily small: the model just reaches a point where it can't improve on one (context)->(word) prediction without worsening some other. So, there's not necessarily anything wrong here. That may be the best loss possible, with a model of this complexity, on this data.

Note that the loss-computation is somewhat of a new and experimental option in gensim, and even as of 3.5.0 there may be issues. (See for example this PR.) It could be better to optimize your meta-parameters, like the number of training epochs, based on some other measure of word-vector quality.

Note that a typical default for the number of training iterations, for a large diverse corpus where words appear evenly throughout, is 5. (This was the value used in Google's original word2vec.c.)

Separately, it's usually a bad, error-prone idea to call train() more than once, and self-manage the alpha learning-rate, rather than just calling it once with the desired epochs and letting it smoothly decay the effective learning-rate by its own gradual linear logic.

(I see you're using a geometric decay, which isn't typical. And you're doing extra step_size/end_lr calculations that aren't being used. Improvising non-standard learning-rate handling is unlikely to help unless that's the focus of your work, with a setup that's already working well as a baseline.)

Other notes:

  • you seem to be enabling skip-gram (not CBOW) mode if your _use_cbow variable is True-ish, which is confusing
  • note that max_vocab_size will cause a extreme trimming of words, during the initial corpus-scan, if the running size hits this threshold - thus may result in a vocabulary size less than your configured value. Ideally you'd set this as high as your memory allows, for the most accurate possible survey counts, then use min_count as the main mechanism to trim the final size to a desired number.

Upvotes: 4

Related Questions