vid1505
vid1505

Reputation: 43

Word vocabulary generated by Word2vec and Glove models are different for the same corpus

I'm using CONLL2003 dataset to generate word embeddings using Word2vec and Glove. The number of words returned by word2vecmodel.wv.vocab is different(much lesser) than glove.dictionary. Here is the code: Word2Vec:

word2vecmodel = Word2Vec(result ,size= 100, window =5, sg = 1)
X = word2vecmodel[word2vecmodel.wv.vocab]
w2vwords = list(word2vecmodel.wv.vocab)

Output len(w2vwords) = 4653

Glove:

from glove import Corpus
from glove import Glove
import numpy as np
corpus = Corpus()
nparray = []
allwords = []
no_clusters=500
corpus.fit(result, window=5)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)

Output: len(glove.dictionary) = 22833

The input is a list of sentences. For example: result[1:5] =

['Peter', 'Blackburn'],
 ['BRUSSELS', '1996-08-22'],
 ['The',
  'European',
  'Commission',
  'said',
  'Thursday',
  'disagreed',
  'German',
  'advice',
  'consumers',
  'shun',
  'British',
  'lamb',
  'scientists',
  'determine',
  'whether',
  'mad',
  'cow',
  'disease',
  'transmitted',
  'sheep',
  '.'],
 ['Germany',
  "'s",
  'representative',
  'European',
  'Union',
  "'s",
  'veterinary',
  'committee',
  'Werner',
  'Zwingmann',
  'said',
  'Wednesday',
  'consumers',
  'buy',
  'sheepmeat',
  'countries',
  'Britain',
  'scientific',
  'advice',
  'clearer',
  '.']]

There are totally 13517 sentences in the result list. Can someone please explain why the list of words for which the embeddings are created are drastically different in size?

Upvotes: 0

Views: 820

Answers (1)

gojomo
gojomo

Reputation: 54173

You haven't mentioned which Word2Vec implementation you're using, but I'll assume you're using the popular Gensim library.

Like the original word2vec.c code released by Google, Gensim Word2Vec uses a default min_count parameter of 5, meaning that any words appearing fewer than 5 times are ignored.

The word2vec algorithm needs many varied examples of a word's usage is different contexts to generate strong word-vectors. When words are rare, they fail to get very good word-vectors themselves: the few examples only show a few uses that may be idiosyncractic compared to what a larger sampling would show, and can't be subtly balanced against many other word representations in the manner that's best.

But further, given that in typical word-distributions, there are many such low-frequency words, altogether they also tend to make the word-vectors for other more-frequent qords worse. The lower-frequency words are, comparatively, 'interference' that absorbs training state/effort to the detriment of other more-improtant words. (At best, you can offset this effect a bit by using more training epochs.)

So, discarding low-frequency words is usually the right approach. If you really need vectors-for those words, obtaining more data so that those words are no longer rare is the best approach.

You can also use a lower min_count, including as low as min_count=1 to retain all words. But often discarding such rare words is better for whatever end-purpose for which the word-vectors will be used.

Upvotes: 1

Related Questions