Reputation: 166
Good evening, I have a relatively simple question that primarily comes from my inexperience with python. I would like to extract word embeddings for a list of words. Here I have created a simple list:
list_word = [['Word'],
['ant'],
['bear'],
['beaver'],
['bee'],
['bird']]
Then load gensim and other required libraries:
#import tweepy # Obtain Tweets via API
import re # Obtain expressions
from gensim.models import Word2Vec #Import gensim Word2Fec
Now when I use the Word2Vec function I run the following:
#extract embedding length 12
model = Word2Vec(list_word, min_count = 3, size = 12)
print(model)
When the model is run I then see that the vocab size is 1, when it should not be. The output is the following: Word2Vec(vocab=1, size=12, alpha=0.025)
I imagine that the imported data is not in the correct format and could use some advise or even example code on how to transform it into the correct format. Thank you for your help.
Upvotes: 0
Views: 1172
Reputation: 1
model = Word2Vec(list_word, min_count = 3, size = 12) you can use vector_size for setting dimensionality
Upvotes: 0
Reputation: 54208
Your list_data
, 6 sentences each with a single word, is insufficient to train Word2Vec
, which requires a lot of varied realistic text data. Among other problems:
min_count=3
setting (& it's not a good idea to lower that parameter)Try using a larger dataset, and you'll see more realistic results. Also, enabling Python logging at the INFO level will show a lot of progress as the code runs - and perhaps hint at issues, as you notice steps happening with or without reasonable counts & delays.
Upvotes: 2