Reputation: 81
I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.
Upvotes: 0
Views: 76
Reputation: 12992
Based on my knowledge, I recommend the following:
sg=0
to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.min_count=5
which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.iter=10
for example.gensim.utils.simple_preprocess
instead of word_tokenize
as the punctuation isn't helpful in this case.When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)
Upvotes: 1
Reputation: 54213
Have you looked at all_words
, just before passing it to Word2Vec
, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0]
.)
Also, have you enabled logging at the INFO
level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample
parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30
is fairly aggressive for a smallish corpus.)
Upvotes: 2