blue-sky
blue-sky

Reputation: 53806

What impact does vocabulary_size have on word2vec tensorflow implementation?

I've performed the steps this guide to generate a vector representation of words.

Now I'm using a custom dataset of 45'000 words I'm running word2vec on.

To run I modified word2vec_basic.py to use my own dataset by modifying https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py#L57 to words = read_data('mytextfile.zip')

I encountered an issue similar to https://github.com/tensorflow/tensorflow/issues/2777 and so reduced the vocabulary_size to 200 . It now runs but the results do not appear to be capturing the context. For example here is a sample output :

Nearest to Leave: Employee, it, •, due, You, appeal, Employees, which,

What can I infer from this output ? Will increasing/decreasing vocabulary_size improve results ?

I'm using python3 so to run I use python3 word2vec_basic2.py

Upvotes: 1

Views: 794

Answers (1)

Lerner Zhang
Lerner Zhang

Reputation: 7130

If the vocabulary_size is too small most of the words will be marked as UKN:

  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count

and hence the coming words which are not contained in the dictionary will be treated all the same(indexed as 0). Increasing the vocabulary size will definitely increase the performance. It does litter to the Python2 or Python3.

For illustration, let's say that there exists 128 input words in the first batch but 120 of that are marked as unknown(with the same index 0), and the targets are also far less than 128, what will happen? We are going to predict pairs such as: UKN from UKN and "you" from UKN, which would be "you" from "are" and "you" from "?" otherwise if you increase the vocabulary size. Most of the information of the sample of the input distribution will be lost.

Upvotes: 1

Related Questions