Aswiderski
Aswiderski

Reputation: 166

Extract word embeddings from word2vec

Good evening, I have a relatively simple question that primarily comes from my inexperience with python. I would like to extract word embeddings for a list of words. Here I have created a simple list:

list_word = [['Word'],
 ['ant'],
 ['bear'],
 ['beaver'],
 ['bee'],
 ['bird']]

Then load gensim and other required libraries:

#import tweepy           # Obtain Tweets via API
import re               # Obtain expressions 
from gensim.models import Word2Vec    #Import gensim Word2Fec

Now when I use the Word2Vec function I run the following:

#extract embedding length 12
model = Word2Vec(list_word, min_count = 3, size = 12)
print(model)

When the model is run I then see that the vocab size is 1, when it should not be. The output is the following: Word2Vec(vocab=1, size=12, alpha=0.025)

I imagine that the imported data is not in the correct format and could use some advise or even example code on how to transform it into the correct format. Thank you for your help.

Upvotes: 0

Views: 1172

Answers (2)

Nimisha C P
Nimisha C P

Reputation: 1

model = Word2Vec(list_word, min_count = 3, size = 12) you can use vector_size for setting dimensionality

Upvotes: 0

gojomo
gojomo

Reputation: 54208

Your list_data, 6 sentences each with a single word, is insufficient to train Word2Vec, which requires a lot of varied realistic text data. Among other problems:

  • words that only appear once will be ignored due to the min_count=3 setting (& it's not a good idea to lower that parameter)
  • single-word sentences have none of the nearby-words contexts the algorithm uses
  • getting good 'dense' vectors requires a vocabulary far larger than the vector-dimensionality, and many varied examples of each word's use with other words

Try using a larger dataset, and you'll see more realistic results. Also, enabling Python logging at the INFO level will show a lot of progress as the code runs - and perhaps hint at issues, as you notice steps happening with or without reasonable counts & delays.

Upvotes: 2

Related Questions