fiskdill
fiskdill

Reputation: 83

How to extract matrix together with vocab from gensim word2vec model

I've trained a word2vec model like so

from gensim.models import Word2Vec

# create model without initializing
model = Word2Vec(min_count=20,
                 window=5,
                 sample=6e-5, 
                 negative=20,
                 workers=cores-1,
                 vector_size=300)

# build vocabulary
w2v_model.build_vocab(sentences, progress_per=10000)

# train model
model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

I'd like to export the model as a dataframe, but not sure how to extract the matrix and vocab together correctly, with the right index positions.

Something like this:

label V1 V2 V...
government 0.560774564 -0.0464625023 ...
state 0.0106112240 0.0464625023 ...
.... ... .. .

I've tried this:

tmp = pd.DataFrame(model.syn1neg)
tmp.insert(0, 'label', model.wv.index_to_key)

which does not square up when comparing

>>> model.wv.get_index('government')
10
>>> tmp.loc[[0]]
0 government 0.329972  0.160003 -0.516633  ...  0.460873 -0.170273 -1.621128  1.255289

Upvotes: 0

Views: 782

Answers (1)

fiskdill
fiskdill

Reputation: 83

For anyone else looking for a solution to this with gensim 4.x.x here's what I wound up doing:

vocab, vectors = model.wv.key_to_index, model.wv.vectors

# get label and vector index.
label_index = np.array([(voc[0], voc[1]) for voc in vocab.items()])

# init dataframe using embedding vectors and set index as node name
tmp =  pd.DataFrame(vectors[label_index[:,1].astype(int)])
tmp.index = label_index[:, 0]
tmp.to_csv("matrix_with_labels.csv")

Not sure this is the best or proper way but it works.

Upvotes: 1

Related Questions