Reputation: 645
I have two corpora that are from the same field, but with a temporal shift, say one decade. I want to train Word2vec models on them, and then investigate the different factors affecting the semantic shift.
I wonder how should I initialize the second model with the first model's embeddings to avoid as much as possible the effect of variance in co-occurrence estimates.
Upvotes: 0
Views: 385
Reputation: 54153
At a naive & easy level, you can just load one existing model, and .train()
on new data. But note if doing that:
epochs
setting) dictate, and thus be nudged arbitrarily-far from their original-model locations, other words from the seed model will stay exactly where they were. But, it's only the interleaved tug-of-war between words in the same training session that makes them usefully comparable. So doing this sequential training – updating only some words in a new training session – is likely to degrade the meaningfulness of word-to-word comparisons, in hard-to-measure ways.Another approach that might be woth trying could be to train single model over the combined corpus - but transform/repeat the era-specific texts/words in certain ways to be able to distinguish earlier-usages from later-usages. There are more details about this suggestion in the context of word-vectors varying over usage-eras in a couple previous answers:
https://stackoverflow.com/a/57400356/130288
https://stackoverflow.com/a/59095246/130288
Upvotes: 1