relationship between window size and the actual sentence length in word2vec

Question

I am trying to run word2vec (skip-gram model implemented in gensim with a default window size of 5) on a corpus of .txt files. The iterator that I use looks something like this:

class Corpus(object):
    """Iterator for feeding sentences to word2vec"""
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):

        word_tokenizer = TreebankWordTokenizer()
        sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        text = ''

        for root, dirs, files in os.walk(self.dirname):

            for file in files:

                if file.endswith(".txt"):

                    file_path = os.path.join(root, file)


                    with open(file_path, 'r') as f:

                         text = f.read().decode('utf-8')
                         sentences = sent_tokenizer.tokenize(text)

                         for sent in sentences:
                             yield word_tokenizer.tokenize(sent)

Here I use the punkt tokenizer (which uses an unsupervised algorithm for detecting sentence boundaries) in the nltk package for splitting the text into sentences. However, when I replace this with just a simple line.split() i.e just considering each sentence as one line and splitting the words, I get a time efficiency that is 1.5 times faster than using the nltk parser. The code inside the 'with open' looks something like this:

                 with open(file_path, 'r') as f:
                    for line in f:
                    line.decode('utf-8')
                    yield line.split()

My question is how important is it for the word2vec algorithm to be fed sentences that are actual sentences (something that I attempt to do with punkt tokenizer)? Is it sufficient for each word in the algorithm to receive a context of the surrounding words that lie on one line (these words may not necessarily be an actual sentence in the case of a sentence spanning several lines) as opposed to the context of words that the word may have in a sentence spanning several lines. Also, what sort of a part does window size play in this. When a window size is set to 5 for example, does the size of sentences yielded by the Sentences iterator ceases to play a part? Will only window size decide the context words then? In that case should I just use line.split() instead of trying to detect actual sentence boundaries using the punkt tokenizer?

I hope I have been able to describe the issue sufficiently, I would really appreciate any opinions or pointers or help regarding this.

relationship between window size and the actual sentence length in word2vec

Answers (1)

Related Questions