Reputation: 1690
I am trying to run word2vec (skip-gram model implemented in gensim with a default window size of 5) on a corpus of .txt files. The iterator that I use looks something like this:
class Corpus(object):
"""Iterator for feeding sentences to word2vec"""
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
word_tokenizer = TreebankWordTokenizer()
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = ''
for root, dirs, files in os.walk(self.dirname):
for file in files:
if file.endswith(".txt"):
file_path = os.path.join(root, file)
with open(file_path, 'r') as f:
text = f.read().decode('utf-8')
sentences = sent_tokenizer.tokenize(text)
for sent in sentences:
yield word_tokenizer.tokenize(sent)
Here I use the punkt tokenizer (which uses an unsupervised algorithm for detecting sentence boundaries) in the nltk package for splitting the text into sentences. However, when I replace this with just a simple line.split()
i.e just considering each sentence as one line and splitting the words, I get a time efficiency that is 1.5 times faster than using the nltk parser. The code inside the 'with open' looks something like this:
with open(file_path, 'r') as f:
for line in f:
line.decode('utf-8')
yield line.split()
My question is how important is it for the word2vec algorithm to be fed sentences that are actual sentences (something that I attempt to do with punkt tokenizer)? Is it sufficient for each word in the algorithm to receive a context of the surrounding words that lie on one line (these words may not necessarily be an actual sentence in the case of a sentence spanning several lines) as opposed to the context of words that the word may have in a sentence spanning several lines. Also, what sort of a part does window size play in this. When a window size is set to 5 for example, does the size of sentences yielded by the Sentences iterator ceases to play a part? Will only window size decide the context words then? In that case should I just use line.split()
instead of trying to detect actual sentence boundaries using the punkt tokenizer?
I hope I have been able to describe the issue sufficiently, I would really appreciate any opinions or pointers or help regarding this.
Upvotes: 3
Views: 3723
Reputation: 420
window is just the context windows. If the window
is set to 5 then for the current word w
surrounding 10 words will be taken as context words. According to the original word2vec code, the word is only trained on the context which is present in the sentence. If the word + context words exceeds the sentence boundaries then rest of the context words are simply ignored (an approximation).
For example:
Consider the sentence: I am a boy
If the current word is boy
and the window
is 2 then we can observe that there is no right context. In this case the code will take the average of vectors am
and a
and consider that as the context of boy
(talking in reference to CBOW model of word2vec).
For the second doubt, I've used the text corpus without sentence boundaries and still word2vec does fine. (tested this in wikipedia corpus)
Hope this resolves your queries.
Upvotes: 4