How to extract matrix together with vocab from gensim word2vec model

Question

I've trained a word2vec model like so

from gensim.models import Word2Vec

# create model without initializing
model = Word2Vec(min_count=20,
                 window=5,
                 sample=6e-5, 
                 negative=20,
                 workers=cores-1,
                 vector_size=300)

# build vocabulary
w2v_model.build_vocab(sentences, progress_per=10000)

# train model
model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

I'd like to export the model as a dataframe, but not sure how to extract the matrix and vocab together correctly, with the right index positions.

Something like this:

label	V1	V2	V...
government	0.560774564	-0.0464625023	...
state	0.0106112240	0.0464625023	...
....	...	..	.

I've tried this:

tmp = pd.DataFrame(model.syn1neg)
tmp.insert(0, 'label', model.wv.index_to_key)

which does not square up when comparing

>>> model.wv.get_index('government')
10
>>> tmp.loc[[0]]
0 government 0.329972  0.160003 -0.516633  ...  0.460873 -0.170273 -1.621128  1.255289

fiskdill · Accepted Answer

For anyone else looking for a solution to this with gensim 4.x.x here's what I wound up doing:

vocab, vectors = model.wv.key_to_index, model.wv.vectors

# get label and vector index.
label_index = np.array([(voc[0], voc[1]) for voc in vocab.items()])

# init dataframe using embedding vectors and set index as node name
tmp =  pd.DataFrame(vectors[label_index[:,1].astype(int)])
tmp.index = label_index[:, 0]
tmp.to_csv("matrix_with_labels.csv")

Not sure this is the best or proper way but it works.

How to extract matrix together with vocab from gensim word2vec model

Answers (1)

Related Questions