Brian Ward
Brian Ward

Reputation: 63

Combining vectors in Gensim Word2Vec vocabulary

Gensim Word2Vec Model has a great method which allows you to find the top n most similar words in the models vocabulary given a list of positive words and negative words.

wv.most_similar(positive=['word1', 'word2', 'word3'], 
                negative=['word4','word5'], topn=10)

What I am looking to do is create word vector that represents an averaged or summed vector of the input positive and negative words. I am hoping to use this new vector to compare to other vectors. Something like this:

newVector = 'word1' + 'word2' + 'word3' - 'word4' - 'word5'

I know that vectors can be summed, but I am not sure if that is the best option. I am hoping to find out exactly how the above function (most_similar) combines the positive vectors and negative vectors, and if Gensim has a function to do so. Thank you in advance.

Upvotes: 1

Views: 603

Answers (2)

Brian Ward
Brian Ward

Reputation: 63

From advice above, I chose to look at Gensim source code and copy their method for averaging the vectors. Here is the code incase it helps anyone else. Note : this code is copied from gensim, and is just reformatted to return the averaged vector.

from gensim import matutils
import numpy as np
from numpy import ndarray, array, float32 as REAL

KEY_TYPES = (str, int, np.integer)

'''
FUNCTION : meanVector(...)
INPUT :
        keyedVectors : word vectors or keyed vectors from gensim model, (model.wv)
        positive : list of words or vectors to be applied positively [default = list()]
        negative : list of words or vectors to be applied negatively [default = list()]
OUTPUT : 
        averaged word vector, [type = numpy.ndarray]
DESCRIPTION :
        allows for simple averaging of positive and negative words and vectors given a gensim model's word vector library.
'''

def meanVector(keyedVectors, positive=list(), negative=list()):
        
    positive = [
            (item, 1.0) if isinstance(item, KEY_TYPES + (ndarray,))
            else item for item in positive
            ]
    negative = [
            (item, -1.0) if isinstance(item, KEY_TYPES + (ndarray,))
            else item for item in negative
            ]
        
    # compute the weighted average of all keys
    all_keys, mean = set(), []
    for key, weight in positive + negative:
        if isinstance(key, ndarray):
            mean.append(weight * key)
        else:
            mean.append(weight * keyedVectors.get_vector(key, norm=True))
            if keyedVectors.has_index_for(key):
                all_keys.add(keyedVectors.get_index(key))
        if not mean:
            raise ValueError("cannot compute similarity with no input")
        
    mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)

    return mean

Note: this has not been thoroughly tested.

Upvotes: 0

gojomo
gojomo

Reputation: 54153

Gensim does not expose a separate function to add/subtract the (unit-normed) vectors in the same way that most_similar() does.

Perhaps it should, as that could be generally useful, including in sharing code between other existing methods.

But as an open-source project, you can look at its exact Python code for that operation, and use it as a model for your own calculations.

For the current code defining that function, see:

https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L687

Upvotes: 0

Related Questions