John Stone
John Stone

Reputation: 53

how to assign weights to articles in the corpus for generating word embedding (e.g. word2vec)?

there are certain articles in the corpus that I found much more important than other articles (for instance I like their wording more). As a result, I would like to increase their "weights" in the entire corpus during the process of generating word vectors. Is there a way to implement this? The current solution that I can think of is to copy the more important articles multiple times, and add them to the corpus. However, will this work for the word embedding process? And is there a better way to achieve this? Many thanks!

Upvotes: 1

Views: 494

Answers (1)

gojomo
gojomo

Reputation: 54173

The word2vec library with which I am most familiar, in gensim for Python, doesn't have a feature to overweight certain texts. However, your idea of simply repeating the more important texts should work.

Note though that:

  • it'd probably work better if the texts don't repeat consecutively in your corpus - spreading out the duplicated contexts so that they're encountered in an interleaved fashion with other diverse usage examples

  • the algorithm really benefits from diverse usage examples – repeating the same rare examples 10 times is nowhere near as good as 10 naturally-subtly-contrasting usages, to induce the kinds of continuous gradations-of-meaning that people want from word2vec

  • you should be sure to test your overweighting strategy, with a quantitative quality score related to your end purpose, to be sure it's helping as you hope. It might be extra code/training-effort for negligible benefit, or even harm some word vectors' quality.

Upvotes: 1

Related Questions