Set the parameters of Word2Vec for a practical example

Question

I have a database containing about 2.8 million texts (more precisely tweets, so they are short texts). I put clean tweets (removing hashtags, tags, stop words...) in a list of lists of tokens called sentences (so it contains a list of tokens for each tweet).

After these steps, if I write

model = Word2Vec(sentences, min_count=1)

I obtain a vocabulary of about 400,000 words.

This was just an attempt, I would need some help to set the parameters (size, window, min_count, workers, sg) of Word2Vec in the most appropriate and consistent way.

Consider that my goal is to use

model.most_similar(terms) (where terms is a list of words)

to find, within the list of lists of tokens sentences, the words most similar to those contained in terms.

The words in terms belong to the same topic and I would like to see if there are other words within the texts that could have to do with the topic.

gojomo · Accepted Answer

Generally, the usual approach is:

Start with the defaults, to get things initially working at a baseline level, perhaps only on a faster-to-work-with subset of the data.
Develop an objective way to determine whether one model is better than another, for your purposes. This might start as a bunch of ad hoc, manual comparisons of results for some representative probes - but should become a process that can automatically score each variant model, giving a higher score to the 'better' model according to some qualitative, repeatable process.
Either tinker with parameters one-by-one, or run a large search over many permutations, to find which model does best on your scoring.

Separately: the quality of word2vec results is almost always improved by discarding the very rarest words, such as those appearing only once. (The default value of min_count is 5 for good reason.)

The algorithm can't make good word-vectors from words that only appear once, or a few times. It needs multiple, contrasting examples of its usage. But, given the typical Zipfian distribution of word usages in a corpus, there are a lot of such rare words. Discarding them speeds training, shrinks the model, & eliminates what's essentially 'noise' from the training of other words - leaving those remaining word-vectors much better. (If you really need vectors for such words – gather more data.)

Set the parameters of Word2Vec for a practical example

Answers (1)

Related Questions