Vocabulary size for Word2Vec model significantly lower than vocab size of the list that its based on?

Question

model1 = Word2Vec(words_list_no_dupes, min_count=0,size= 20,workers=3, window =3, sg = 1)
print(model1)
print(len(model1.wv.vocab))
print(model.wv.vectors.shape)

output: Word2Vec(vocab=58, size=20, alpha=0.025) 58 (31752, 20)

However when I check to see the length of the list that the model is formed from

print(len(words_list_no_dupes))

output:

1906

Whats causing this? The full code I used to derive remove duplicates from the list is here:

words = []
for r in range(0,len(df)):
    temp = []
    for word in nltk.tokenize.WhitespaceTokenizer().tokenize(df["CAR NAME"][r]):   
        temp.append(word.lower())
    words.append(temp)
words_flat_list = [item for sublist in words for item in sublist]

def remove_duplicates(x):
  return list(dict.fromkeys(x))

words_list_no_dupes = remove_duplicates(words_flat_list)

Vocabulary size for Word2Vec model significantly lower than vocab size of the list that its based on?

Answers (1)

Related Questions