Reputation: 767
I have a database containing about 2.8 million texts (more precisely tweets, so they are short texts). I put clean tweets (removing hashtags, tags, stop words...) in a list of lists of tokens called sentences
(so it contains a list of tokens for each tweet).
After these steps, if I write
model = Word2Vec(sentences, min_count=1)
I obtain a vocabulary of about 400,000 words.
This was just an attempt, I would need some help to set the parameters (size
, window
, min_count
, workers
, sg
) of Word2Vec
in the most appropriate and consistent way.
Consider that my goal is to use
model.most_similar(terms)
(where terms
is a list of words)
to find, within the list of lists of tokens sentences
, the words most similar to those contained in terms
.
The words in terms
belong to the same topic and I would like to see if there are other words within the texts that could have to do with the topic.
Upvotes: 1
Views: 1215
Reputation: 54173
Generally, the usual approach is:
Separately: the quality of word2vec results is almost always improved by discarding the very rarest words, such as those appearing only once. (The default value of min_count
is 5
for good reason.)
The algorithm can't make good word-vectors from words that only appear once, or a few times. It needs multiple, contrasting examples of its usage. But, given the typical Zipfian distribution of word usages in a corpus, there are a lot of such rare words. Discarding them speeds training, shrinks the model, & eliminates what's essentially 'noise' from the training of other words - leaving those remaining word-vectors much better. (If you really need vectors for such words – gather more data.)
Upvotes: 2