Clarification on how word2vec `generate_batch()` works?

Question

I have been trying to understand how it works to apply it to my test and dataset (I find that tensorflow code on github is too complex and not very straightforward).

I will be using a skip-gram model. This is the code that I wrote. I'd like a non cryptic explanation of what's going on and what I need to do to make this work.

def generate_batch(self):
    inputs = []
    labels = []
    for i,phrase in enumerate(self.training_phrases): # training_phrases look like this: ['I like that cat', '...', ..]
        array_list = utils.skip_gram_tokenize(phrase) # This transforms a sentence into an array of arrays of numbers representing the sentence, ex. [[181, 152], [152, 165], [165, 208], [208, 41]]
        for array in array_list:
            inputs.append(array) # I noticed that this is useless, I could just do inputs = array_list

    return inputs, labels

This is where I am right now. From the generate_batch() that tensorflow provides on github, I can see that it returns inputs, labels .

I assume that inputs is the array of skip grams, but what is labels? How do I generate them?

Also, I saw that it implements batch_size, how can I do that (I assume I have to split the data in smaller pieces, but how does that work? I put the data into an array?).

Regarding batch_size, what happens if the batch size is 16, but the data offers only 130 inputs? Do I do 8 regular batches and then a minibatch of 2 inputs?

Vijay Mariappan · Accepted Answer

For skip-gram you need to feed the input-label pair as the current word and its context word. The context word for each input word is defined within a window of the text phrases.

Consider the following text phrase: "Here's looking at you kid". For a window of 3, and for the current word at, you have two context words looking and you. So the input label pairs are {at, looking}, {at, you}, which you convert them into a number representation.

In the above code, the array list example is given as: ex. [[181, 152], [152, 165], [165, 208], [208, 41]], which means current word and its context is defined for the next word and not for the previous word.

The architecture looks something like below:

Now you have these pairs generated, get them in batches and train them. Its ok to have uneven size batches, but make sure that your loss is average loss and not a sum loss.

Clarification on how word2vec `generate_batch()` works?

Answers (1)

Related Questions