Reputation: 85
model1 = Word2Vec(words_list_no_dupes, min_count=0,size= 20,workers=3, window =3, sg = 1)
print(model1)
print(len(model1.wv.vocab))
print(model.wv.vectors.shape)
output: Word2Vec(vocab=58, size=20, alpha=0.025) 58 (31752, 20)
However when I check to see the length of the list that the model is formed from
print(len(words_list_no_dupes))
output:
1906
Whats causing this? The full code I used to derive remove duplicates from the list is here:
words = []
for r in range(0,len(df)):
temp = []
for word in nltk.tokenize.WhitespaceTokenizer().tokenize(df["CAR NAME"][r]):
temp.append(word.lower())
words.append(temp)
words_flat_list = [item for sublist in words for item in sublist]
def remove_duplicates(x):
return list(dict.fromkeys(x))
words_list_no_dupes = remove_duplicates(words_flat_list)
Upvotes: 1
Views: 2141
Reputation: 54163
The vocabulary size will be the number of unique tokens seen in the training corpus.
It won't have any necessary relationship with the length, in number of texts, in the corpus (len(words_list_no_dupes)
) – because each text should itself have many words, including many words repeated from other texts.
If your corpus is not like that – if each text is a just one or two words, like a car name, and no words repeat from text to text – then your corpus is not good for word2vec training. Word2vec requires many examples of each word's usage, in contexts with varying mixtures of surrounding words.
That said, your shown output is a little odd: the len(model1.wv.vocab)
should be the same size as model.wv.vectors.shape[0]
– but your output shows 58 then 31,752. Are you sure those are the values from your run?
Also, your 'full code I used to derive remove duplicates from the list' is a bit confusing in intent and effect. You could show, in your question, some examples of what's in the list at the beginning, and end, to perhaps reveal why it's not proper input for Word2Vec
. For example, what are the 1st few items in words_flat_list
?
print(words_flat_list[0:3])
Then, what are the 1st few items in words_list_no_dupes
?
print(words_list_no_dupes[0:3])
Is that what you were expecting?
Is that words_list_no_dupes
, that you're passing into Word2Vec
, what it expects - which is a Python sequence, where each item is a list-of-string-tokens? (If it's anything else, you should expect weird results.)
Upvotes: 1