Word vocabulary generated by Word2vec and Glove models are different for the same corpus

Question

I'm using CONLL2003 dataset to generate word embeddings using Word2vec and Glove. The number of words returned by word2vecmodel.wv.vocab is different(much lesser) than glove.dictionary. Here is the code: Word2Vec:

word2vecmodel = Word2Vec(result ,size= 100, window =5, sg = 1)
X = word2vecmodel[word2vecmodel.wv.vocab]
w2vwords = list(word2vecmodel.wv.vocab)

Output len(w2vwords) = 4653

Glove:

from glove import Corpus
from glove import Glove
import numpy as np
corpus = Corpus()
nparray = []
allwords = []
no_clusters=500
corpus.fit(result, window=5)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)

Output: len(glove.dictionary) = 22833

The input is a list of sentences. For example: result[1:5] =

['Peter', 'Blackburn'],
 ['BRUSSELS', '1996-08-22'],
 ['The',
  'European',
  'Commission',
  'said',
  'Thursday',
  'disagreed',
  'German',
  'advice',
  'consumers',
  'shun',
  'British',
  'lamb',
  'scientists',
  'determine',
  'whether',
  'mad',
  'cow',
  'disease',
  'transmitted',
  'sheep',
  '.'],
 ['Germany',
  "'s",
  'representative',
  'European',
  'Union',
  "'s",
  'veterinary',
  'committee',
  'Werner',
  'Zwingmann',
  'said',
  'Wednesday',
  'consumers',
  'buy',
  'sheepmeat',
  'countries',
  'Britain',
  'scientific',
  'advice',
  'clearer',
  '.']]

There are totally 13517 sentences in the result list. Can someone please explain why the list of words for which the embeddings are created are drastically different in size?

Word vocabulary generated by Word2vec and Glove models are different for the same corpus

Answers (1)

Related Questions