jonbon
jonbon

Reputation: 1200

How to fetch vectors for a word list with Word2Vec?

I want to create a text file that is essentially a dictionary, with each word being paired with its vector representation through word2vec. I'm assuming the process would be to first train word2vec and then look-up each word from my list and find its representation (and then save it in a new text file)?

I'm new to word2vec and I don't know how to go about doing this. I've read from several of the main sites, and several of the questions on Stack, and haven't found a good tutorial yet.

Upvotes: 23

Views: 47481

Answers (9)

combokang
combokang

Reputation: 11

First train your Word2Vec model like you said.

To get key-vector pairs of a list of words, you can use a convenient method .vectors_for_all that Gensim now provides for KeyedVectors object.

example:

words = ["apple", "machine", "learning]
word_vectors = model.wv.vectors_for_all(words)

The result is also a KeyedVectors object. After getting the vectors you can do whatever you want.

Upvotes: 1

Aminur Rahman Ashik
Aminur Rahman Ashik

Reputation: 90

I would suggest this, you may find anything you need including Word2Vec, FastText, Doc2Vec, KeyedVectors and so on...

Upvotes: 0

Homa
Homa

Reputation: 11

Gensim 4.0 updates: vocab method is depreciated and change in how to parse a word's vector

Get the ordered list of words in the vocabulary

words = list(w for w in model.wv.index_to_key)

Get the vector for 'also'

print(model.wv['also'])

Upvotes: 1

Sunanda
Sunanda

Reputation: 474

If you are willing to use python with gensim package, then building upon this answer and Gensim Word2Vec Documentation you could do something like this

from gensim.models import Word2Vec

# Take some sample sentences
tokenized_sentences = [["here","is","one"],["and","here","is","another"]]

# Initialise model, for more information, please check the Gensim Word2vec documentation
model = Word2Vec(tokenized_sentences, size=100, window=2, min_count=0)

# Get the ordered list of words in the vocabulary
words = model.wv.vocab.keys()

# Make a dictionary
we_dict = {word:model.wv[word] for word in words}

Upvotes: 2

keramat
keramat

Reputation: 4543

For gensim 4.0:

my_dict = dict({})

for word in word_list:
     my_dict[word] = model.wv.get_vector('0', norm = True) 

Upvotes: 0

TrickOrTreat
TrickOrTreat

Reputation: 911

Using basic python:

all_vectors = []
for index, vector in enumerate(model.wv.vectors):
    vector_object = {}
    vector_object[list(model.wv.vocab.keys())[index]] = vector
    all_vectors.append(vector_object)

Upvotes: 0

Wickkiey
Wickkiey

Reputation: 4632

You can Directly get the vectors through

model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.wv.vectors

and words through

model.wv.vocab.keys()

Hope it helps !

Upvotes: 4

Moobie
Moobie

Reputation: 1654

The direct access model[word] is deprecated and will be removed in Gensim 4.0.0 in order to separate the training and the embedding. The command should be replaced with, simply, model.wv[word].

Using Gensim in Python, after vocabs are built and the model trained, you can find the word count and sampling information already mapped in model.wv.vocab, where model is the variable name of your Word2Vec object.

Thus, to create a dictionary object, you may:

my_dict = dict({})
for idx, key in enumerate(model.wv.vocab):
    my_dict[key] = model.wv[key]
    # Or my_dict[key] = model.wv.get_vector(key)
    # Or my_dict[key] = model.wv.word_vec(key, use_norm=False)

Now that you have your dictionary, you can write it to a file with whatever means you like. For example, you can use the pickle library. Alternatively, if you are using Jupyter Notebook, they have a convenient 'magic command' %store my_dict > filename.txt. Your filename.txt will look like:

{'one': array([-0.06590105,  0.01573388,  0.00682817,  0.53970253, -0.20303348,
   -0.24792041,  0.08682659, -0.45504045,  0.89248925,  0.0655603 ,
   ......
   -0.8175681 ,  0.27659689,  0.22305458,  0.39095637,  0.43375066,
    0.36215973,  0.4040089 , -0.72396156,  0.3385369 , -0.600869  ],
  dtype=float32),
 'two': array([ 0.04694849,  0.13303463, -0.12208422,  0.02010536,  0.05969441,
   -0.04734801, -0.08465996,  0.10344813,  0.03990637,  0.07126121,
    ......
    0.31673026,  0.22282903, -0.18084198, -0.07555179,  0.22873943,
   -0.72985399, -0.05103955, -0.10911274, -0.27275378,  0.01439812],
  dtype=float32),
 'three': array([-0.21048863,  0.4945509 , -0.15050395, -0.29089224, -0.29454648,
    0.3420335 , -0.3419629 ,  0.87303966,  0.21656844, -0.07530259,
    ......
   -0.80034876,  0.02006451,  0.5299498 , -0.6286509 , -0.6182588 ,
   -1.0569025 ,  0.4557548 ,  0.4697938 ,  0.8928275 , -0.7877308 ],
  dtype=float32),
  'four': ......
}

You may also wish to look into the native save / load methods of Gensim's word2vec.

Upvotes: 26

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

Gensim tutorial explains it very clearly.

First, you should create word2vec model - either by training it on text, e.g.

 model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

or by loading pre-trained model (you can find them here, for example).

Then iterate over all your words and check for their vectors in the model:

for word in words:
  vector = model[word]

Having that, just write word and vector formatted as you want.

Upvotes: 12

Related Questions