Normalize vectors in gensim model

I have a pre-trained word embedding with vectors of different norms, and I want to normalize all vectors in the model. I am doing it with a for loop that iterates each word and normalizes its vector, but the model us huge and takes too much time. Does gensim include any way to do this faster? I cannot find it.

Thanks!!

Upvotes: 4

Views: 6003

Answers (1)

gojomo
gojomo

Reputation: 54173

Gensim instances of KeyedVectors (the common interface of sets of word-vectors) contain a method init_sims(), which internally calculates unit-length normalized vectors using a native vector operation for speed.

When certain operations that are usually conducted on unit-normalized vectors are attempted for the 1st time, this init_sims() will be automatically called, and the model will cache the normalized vectors in a model property (vectors_norm) – roughly doubling the RAM consumption.

Once it's been called, you can access normed vectors using the .word_vec() method:

normed_wv = kv_model.word_vec(word, use_norm=True)

If you're sure you won't need the raw, un-normed vectors, you can also call init_sim() yourself with its optional replace parameter. Then, the normed vectors will clobber the raw vectors in-place – saving the extra RAM. For example:

kv_model.init_sims(replace=True)

Note that while things like finding the nearest-neighbors of a word, as in the common most_similar() operation, traditionally use unit-normalized vectors, there are sometimes downstream applications where the raw vectors are useful. (Also, in a full Word2Vec model, if you're going to do additional incremental training, that should happen on raw vectors, not normalized vectors.)

Upvotes: 7

Related Questions