Using weight from a Gensim Word2Vec model as a starting point of another model

Question

I have two corpora that are from the same field, but with a temporal shift, say one decade. I want to train Word2vec models on them, and then investigate the different factors affecting the semantic shift.

I wonder how should I initialize the second model with the first model's embeddings to avoid as much as possible the effect of variance in co-occurrence estimates.

gojomo · Accepted Answer

At a naive & easy level, you can just load one existing model, and .train() on new data. But note if doing that:

Any words not already known by the model will be ignored, and the word-frequencies that feed algorithmic steps will only be from the initial survey
While all words in the current corpus will get as many training-updates as their appearances (& your epochs setting) dictate, and thus be nudged arbitrarily-far from their original-model locations, other words from the seed model will stay exactly where they were. But, it's only the interleaved tug-of-war between words in the same training session that makes them usefully comparable. So doing this sequential training – updating only some words in a new training session – is likely to degrade the meaningfulness of word-to-word comparisons, in hard-to-measure ways.

Another approach that might be woth trying could be to train single model over the combined corpus - but transform/repeat the era-specific texts/words in certain ways to be able to distinguish earlier-usages from later-usages. There are more details about this suggestion in the context of word-vectors varying over usage-eras in a couple previous answers:

https://stackoverflow.com/a/57400356/130288

https://stackoverflow.com/a/59095246/130288

Using weight from a Gensim Word2Vec model as a starting point of another model

Answers (1)

Related Questions