Reputation: 11
I just make word2vec model and make dictionary between word (key) and value (vector).
dictionary = dict({})
for idx, key in enumerate(model.wv.vocab):
dictionary[key] = model.wv[key]
and i tried to get key based on value with:
def get_key(val):
for key, value in dictionary.items():
if val == value:
return key
return "key doesn't exist"
But the result is "key doesn't exist". What is the best solution to get key from np.array value?
Upvotes: 1
Views: 558
Reputation: 3855
As these values are floating point values, you may have issues trying to directly compare them for equality. You may have better success using the numpy.isclose function. This function takes at least two parameters, array-like a
and array-like b
, and returns a new array with either True
or False
in each index corresponding to whether the elements at the same index in a
and b
are close in value. You can then check if all of these values are close by using the numpy.all function, which takes at least one parameter, array-like a
, and in the case of a 1D array, checks if all of the values are True
and returns abool
. You could modify your get_key
function like this:
def get_key(val):
for key, value in dictionary.items():
if all(isclose(val, value)):
return key
return "key doesn't exist"
This way, if all of the corresponding values in the two arrays are close, it will return the key, otherwise it will go on to the next key-value pair for comparison.
Note: This only works if the elements of the array that you pass in to your function are guaranteed to be in the same order as the array in the dictionary
Upvotes: 0
Reputation: 54173
In general, there's no reason to create your dictionary. It's a redundant, less-RAM-efficient way to store the same info that's already in your model.wv
object. (That object will be some class related to the gensim KeyedVectors
type.)
From the model.wv
object, you already have lookup-by-key. And, you already have an efficient ordered array of all its values, in model.wv.vectors
. And, model.wv.index2entity
is a list that can map back from ordinal locations in that array to the related words.
But also, it's very rare to need to look up exact vectors by value. The training process involves a lot of randomness, and doesn't necessarily have any 'correct' or 'ideal' ending point - just some relative-arrangement that's as-good-as-others for the target training goal, and about as-good-as-it-can-get.
Thus, neither the exact locations are important – compared to the distances/relative-direction to toher vectors, nor are the least-significant-digits in the full dimensions anything but noise.
So, the only way to expect to find an exact vector, in a set of trained word-vectors, is to have first requested that exact same vector from the set – then looking for exactly it.
I suspect if you did that, rather than typing your desired values into a list
literal, your existing function might work. You'd be checking for the exact right coordinates, and array type, which could match – if that exact vector was in the set. For example, your function might work with input:
target_value = dictionary.get('petani')
get_key(target_value)
The far more-common operation on vector sets is to find the top-N items, in the set, close to some target location – with that location specified either as a key with a known vector-location, or a raw vector. The gensim KeyedVector
classes offer this, via a most_similar()
method. (This is a bit more expensive, as it requires evaluating the cosine-similarity with all the vectors, then sorting and returning the most-similar matches. But that's done by fairly efficient native library array operations.
For example, take a look at the results of either of these code fragments with your model:
model.wv.most_similar('petani')
...or...
target_vector = model.wv['petani']
model.wv.most_similar(positive=[target_vector,])
Upvotes: 1