Gensim: word mover distance with string as input instead of list of string

Question

I'm trying to find out how similar are 2 sentences. For doing it i'm using gensim word mover distance and since what i'm trying to find it's a similarity i do like it follow:

sim = 1 - wv.wmdistance(sentence_obama, sentence_president)

What i give as an input are 2 strings:

    sentence_obama = 'Obama speaks to the media in Illinois'
    sentence_president = 'The president greets the press in Chicago'

The model i'm using is the one that you can find on the web: word2vec-google-news-300 I load it with this code:

wv = api.load("word2vec-google-news-300")

It give me reasonable results. Here it's where the problem starts. For what i can read from the documentation here it seems the wmd take as input a list of string and not a string like i do!

def preprocess(sentence):
   return [w for w in sentence.lower().split() if w not in stop_words]

sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)

When i follow the documentation i get results really different:

wmd using string as input: 0.5562025871542842
wmd using list of string as input: -0.0174646259300113

I'm really confused. Why is it working with string as input and it works better than when i give what the documentation is asking for?

Gensim: word mover distance with string as input instead of list of string

Answers (1)

Related Questions