Reputation: 691
I have a pre-trained word embedding with vectors of different norms, and I want to normalize all vectors in the model. I am doing it with a for loop that iterates each word and normalizes its vector, but the model us huge and takes too much time. Does gensim
include any way to do this faster? I cannot find it.
Thanks!!
Upvotes: 4
Views: 6003
Reputation: 54173
Gensim instances of KeyedVectors
(the common interface of sets of word-vectors) contain a method init_sims()
, which internally calculates unit-length normalized vectors using a native vector operation for speed.
When certain operations that are usually conducted on unit-normalized vectors are attempted for the 1st time, this init_sims()
will be automatically called, and the model will cache the normalized vectors in a model property (vectors_norm
) – roughly doubling the RAM consumption.
Once it's been called, you can access normed vectors using the .word_vec()
method:
normed_wv = kv_model.word_vec(word, use_norm=True)
If you're sure you won't need the raw, un-normed vectors, you can also call init_sim()
yourself with its optional replace
parameter. Then, the normed vectors will clobber the raw vectors in-place – saving the extra RAM. For example:
kv_model.init_sims(replace=True)
Note that while things like finding the nearest-neighbors of a word, as in the common most_similar()
operation, traditionally use unit-normalized vectors, there are sometimes downstream applications where the raw vectors are useful. (Also, in a full Word2Vec
model, if you're going to do additional incremental training, that should happen on raw vectors, not normalized vectors.)
Upvotes: 7