Reputation: 53806
I've performed the steps this guide to generate a vector representation of words.
Now I'm using a custom dataset of 45'000 words I'm running word2vec on.
To run I modified word2vec_basic.py to use my own dataset by modifying https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py#L57 to words = read_data('mytextfile.zip')
I encountered an issue similar to https://github.com/tensorflow/tensorflow/issues/2777 and so reduced the vocabulary_size
to 200 . It now runs but the results do not appear to be capturing the context. For example here is a sample output :
Nearest to Leave: Employee, it, •, due, You, appeal, Employees, which,
What can I infer from this output ? Will increasing/decreasing vocabulary_size
improve results ?
I'm using python3 so to run I use python3 word2vec_basic2.py
Upvotes: 1
Views: 794
Reputation: 7130
If the vocabulary_size is too small most of the words will be marked as UKN:
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count += 1
data.append(index)
count[0][1] = unk_count
and hence the coming words which are not contained in the dictionary will be treated all the same(indexed as 0). Increasing the vocabulary size will definitely increase the performance. It does litter to the Python2 or Python3.
For illustration, let's say that there exists 128 input words in the first batch but 120 of that are marked as unknown(with the same index 0), and the targets are also far less than 128, what will happen? We are going to predict pairs such as: UKN from UKN and "you" from UKN, which would be "you" from "are" and "you" from "?" otherwise if you increase the vocabulary size. Most of the information of the sample of the input distribution will be lost.
Upvotes: 1