OneAndOnly
OneAndOnly

Reputation: 1056

Gensim's word2vec has a loss of 0 from epoch 1?

I am using the Word2vec module of Gensim library to train a word embedding, the dataset is 400k sentences with 100k unique words (its not english)

I'm using this code to monitor and calculate the loss :

class MonitorCallback(CallbackAny2Vec):
    def __init__(self, test_words):
        self._test_words = test_words

    def on_epoch_end(self, model):
        print("Model loss:", model.get_latest_training_loss())  # print loss
        for word in self._test_words:  # show wv logic changes
            print(model.wv.most_similar(word))


monitor = MonitorCallback(["MyWord"])  # monitor with demo words

w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE, window=W2V_WINDOW, min_count=W2V_MIN_COUNT  , callbacks=[monitor])

w2v_model.build_vocab(tokenized_corpus)

words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)

print("[*] Training...")

# Train Word Embeddings
w2v_model.train(tokenized_corpus, total_examples=len(tokenized_corpus), epochs=W2V_EPOCH)

The problem is from epoch 1 the loss is 0 and the vector of the monitored words dont change at all!

[*] Training...
Model loss: 0.0
Model loss: 0.0
Model loss: 0.0
Model loss: 0.0

so what is the problem here? is this normal? the tokenized corpus is a list of lists that are something like tokenized_corpus[0] = [ "word1" , "word2" , ...]

I googled and seems like some of the old versions of gensim had problem with calculating loss function, but they are from almost a year ago and it seems like it should be fixed right now?

I tried the code provided in the answer of this question as well but still the loss is 0 :

Loss does not decrease during training (Word2Vec, Gensim)

EDIT1 : after adding compute_loss=True, the loss shows up, but it keeps going higher and higher, and the top similar words and their similarity doesn't change at all :

Model loss: 2187903.5
Model loss: 3245492.0
Model loss: 4103624.5
Model loss: 4798541.0
Model loss: 5413940.0
Model loss: 5993822.5
Model loss: 6532631.0
Model loss: 7048384.5
Model loss: 7547147.0

Upvotes: 2

Views: 1739

Answers (1)

gojomo
gojomo

Reputation: 54153

The top issue with your code is that you haven't used the Word2Vec initialization parameter necessary to toggle loss-tracking on: compute_loss=True

(See 'parameters' section of https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec )

Even with that fix, the loss-reporting is still quite buggy (as of gensim-3.8.3 & this writing in August 2020):

  • it's not the per-epoch total, or per-example average, one might expect. (So if you need that, as a workaround, your callback should be remembering the last value and computing the delta, or resetting the internal counter to 0.0, each epoch's end.)
  • it definitely loses precision in larger training runs, eventually becoming useless. (This may not be an issue for you.)
  • it might lose some tallies due to multithreaded value-overwriting. (This may not be a practical issue for you, depending on why you're consulting the loss value.)

Upvotes: 2

Related Questions