Reputation: 944
Hello Community Members,
At present, I am implementing the Word2Vec algorithm.
Firstly, I have extracted the data (sentences), break and split the sentences into tokens (words), remove the punctuation marks and store the tokens in a single list. The list basically contain the words. Then I have calculated the frequency of words and then computed it occurrences in terms of frequency. It results a list.
Next, I am trying to load the model using gensim. However, I am facing a problem. The problem is about the word is not in the vocabulary
. The code snippet, whatever I have tried is as follows.
import nltk, re, gensim
import string
from collections import Counter
from string import punctuation
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from nltk.corpus import gutenberg, stopwords
def preprocessing():
raw_data = (gutenberg.raw('shakespeare-hamlet.txt'))
tokens = word_tokenize(raw_data)
tokens = [w.lower() for w in tokens]
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
global words
words = [word for word in stripped if word.isalpha()]
sw = (stopwords.words('english'))
sw1= (['.', ',', '"', '?', '!', ':', ';', '(', ')', '[', ']', '{', '}'])
sw2= (['for', 'on', 'ed', 'es', 'ing', 'of', 'd', 'is', 'has', 'have', 'been', 'had', 'was', 'are', 'were', 'a', 'an', 'the', 't', 's', 'than', 'that', 'it', '&', 'and', 'where', 'there', 'he', 'she', 'i', 'and', 'with', 'it', 'to', 'shall', 'why', 'ham'])
stop=sw+sw1+sw2
words = [w for w in words if not w in stop]
preprocessing()
def freq_count():
fd = nltk.FreqDist(words)
print(fd.most_common())
freq_count()
def word_embedding():
for i in range(len(words)):
model = Word2Vec(words, size = 100, sg = 1, window = 3, min_count = 1, iter = 10, workers = 4)
model.init_sims(replace = True)
model.save('word2vec_model')
model = Word2Vec.load('word2vec_model')
similarities = model.wv.most_similar('hamlet')
for word, score in similarities:
print(word , score)
word_embedding()
Note: I am using Python 3.7 in Windows OS. From the syntax of gensim
, it is suggested to use sentences and split into tokens and apply the same to build and train the model. My question is that how to apply the same to a corpus with single list containing only words. I have specified the words also using list, i.e. [words], during the training of the model.
Upvotes: 1
Views: 1342
Reputation: 54243
Madhan Varadhodiyil's answer has identified your main problem, passing a list-of-words where Word2Vec
expects a sequence-of-sentences (such as a list-of-list-of-words). As a result, each word is seen as a sentence, and then each letter is seen as one word of a sentence – and your resulting model thus probably has just a few dozen single-character 'words'.
If you enabled logging at the INFO level, and watched the output – always good ideas when trying to understand a process or debug a problem – you may have noticed the reported counts of sentences/words as being off.
Additionally:
'Hamlet' has about 30,000 words – but gensim Word2Vec
's optimized code has an implementation limit of 10,000 words per text example (sentence) – so passing the full text in as if it were a single text will cause about 2/3 of it to be silently ignored. Pass it as a series of shorter texts (such as sentences, paragraphs, or even scenes/acts) instead.
30,000 words is very, very, very small for good word-vectors, which are typically based on millions to billions of words' worth of usage examples. When working with a small corpus, sometimes more training passes than the default epochs=5
can help, sometimes shrinking the dimensionality of the vectors below the default vector_size=100
can help, but you won't be getting the full value of the algorithm, which really depends on large diverse text examples to achieve meaningful arrangements of words.
Usually words with just 1 or a few usage examples can't get good vectors from those few (not-necessarily-representative) examples, and further the large number of such words act as noise/interference in the training of other words (that could get good word-vectors). So setting min_count=1
usually results in worse word-vectors, for both rare and frequent words, on task-specific measures of quality, than the default of discarding rare words entirely.
Upvotes: 2
Reputation: 2116
The first parameter passed to Word2Vec
expects an list of sentences. You're passing a list of words
import nltk
import re
import gensim
import string
from collections import Counter
from string import punctuation
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from nltk.corpus import gutenberg, stopwords
def preprocessing():
raw_data = (gutenberg.raw('shakespeare-hamlet.txt'))
tokens = word_tokenize(raw_data)
tokens = [w.lower() for w in tokens]
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
global words
words = [word for word in stripped if word.isalpha()]
sw = (stopwords.words('english'))
sw1 = (['.', ',', '"', '?', '!', ':', ';', '(', ')', '[', ']', '{', '}'])
sw2 = (['for', 'on', 'ed', 'es', 'ing', 'of', 'd', 'is', 'has', 'have', 'been', 'had', 'was', 'are', 'were', 'a', 'an', 'the', 't',
's', 'than', 'that', 'it', '&', 'and', 'where', 'there', 'he', 'she', 'i', 'and', 'with', 'it', 'to', 'shall', 'why', 'ham'])
stop = sw + sw1 + sw2
words = [w for w in words if not w in stop]
preprocessing()
def freq_count():
fd = nltk.FreqDist(words)
print(fd.most_common())
freq_count()
def word_embedding():
for i in range(len(words)):
print(type(words))
#pass words as a list.
model = Word2Vec([words], size=100, sg=1, window=3,
min_count=1, iter=10, workers=4)
model.init_sims(replace=True)
model.save('word2vec_model')
model = Word2Vec.load('word2vec_model')
similarities = model.wv.most_similar('hamlet')
for word, score in similarities:
print(word, score)
word_embedding()
hope this helps :)
Upvotes: 2