JTalbott
JTalbott

Reputation: 81

Word2Vec Vocab Similarities

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.

My constructor for the model is

model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)

I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:

all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]

...where v is the result of calling read() on a txt file.

Upvotes: 0

Views: 76

Answers (2)

Anwarvic
Anwarvic

Reputation: 12992

Based on my knowledge, I recommend the following:

  • Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
  • Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
  • Don't remove the stop words as it will change the neighboring words in the moving window.
  • Use more iterations like iter=10 for example.
  • Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
  • Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not

When following these steps, your code should be:

>>> from gensim.utils import simple_preprocess

>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

Upvotes: 1

gojomo
gojomo

Reputation: 54213

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)

Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?

Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.

(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Upvotes: 2

Related Questions