how to assign weights to articles in the corpus for generating word embedding (e.g. word2vec)?

Question

there are certain articles in the corpus that I found much more important than other articles (for instance I like their wording more). As a result, I would like to increase their "weights" in the entire corpus during the process of generating word vectors. Is there a way to implement this? The current solution that I can think of is to copy the more important articles multiple times, and add them to the corpus. However, will this work for the word embedding process? And is there a better way to achieve this? Many thanks!

gojomo · Accepted Answer

The word2vec library with which I am most familiar, in gensim for Python, doesn't have a feature to overweight certain texts. However, your idea of simply repeating the more important texts should work.

Note though that:

it'd probably work better if the texts don't repeat consecutively in your corpus - spreading out the duplicated contexts so that they're encountered in an interleaved fashion with other diverse usage examples
the algorithm really benefits from diverse usage examples – repeating the same rare examples 10 times is nowhere near as good as 10 naturally-subtly-contrasting usages, to induce the kinds of continuous gradations-of-meaning that people want from word2vec
you should be sure to test your overweighting strategy, with a quantitative quality score related to your end purpose, to be sure it's helping as you hope. It might be extra code/training-effort for negligible benefit, or even harm some word vectors' quality.

how to assign weights to articles in the corpus for generating word embedding (e.g. word2vec)?

Answers (1)

Related Questions