Word2Vec returning vectors for individual character and not words

Question

For the following list:

words= ['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA','unimodal','7','regarding','random','59','intimating','COMPETITION','prospects','2K15','gather','Mega','SENSOR','NCTT','NETWORKING','orgainsed','acts']

I try to:

from gensim.models import Word2Vec
vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']

Which returns:

KeyError: "word 'gather' not in vocabulary"

But

vec_model['g']

Does return a vector, so believe i'm returning all vectors for characters found in the list instead of vectors for all words found in the list.

KRKirov · Accepted Answer

Word2Vec expects a list of lists as input, where the corpus (main list) is composed of individual documents. The individual documents are composed of individual words (tokens). Word2Vec iterates over all documents and all tokens. In your example you have passed a single list to Word2Vec, therefore Word2Vec interprets each word as an individual document and iterates over each word character which is interpreted as a token. Therefore you have built a vocabulary of characters not words. To build a vocabulary of words you can pass a nested list to Word2Vec as in the example below.

from gensim.models import Word2Vec

words= [['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA'],
['unimodal','7','regarding','random','59','intimating'],
['COMPETITION','prospects','2K15','gather','Mega'],
['SENSOR','NCTT','NETWORKING','orgainsed','acts']]

vec_model= Word2Vec(words, min_count=1, size=30)
vec_model['gather']

Output:

array([ 0.01106581,  0.00968017, -0.00090574,  0.01115612, -0.00766465,
       -0.01648632, -0.01455364,  0.01107104,  0.00769841,  0.01037362,
        0.01551551, -0.01188449,  0.01262331,  0.01608987,  0.01484082,
        0.00528397,  0.01613582,  0.00437328,  0.00372362,  0.00480989,
       -0.00299072, -0.00261444,  0.00282137, -0.01168992, -0.01402746,
       -0.01165612,  0.00088562,  0.01581018, -0.00671618, -0.00698833],
      dtype=float32)

Word2Vec returning vectors for individual character and not words

Answers (1)

Related Questions