slimhabit
slimhabit

Reputation: 1

Gensim: word mover distance with string as input instead of list of string

I'm trying to find out how similar are 2 sentences. For doing it i'm using gensim word mover distance and since what i'm trying to find it's a similarity i do like it follow:

sim = 1 - wv.wmdistance(sentence_obama, sentence_president)

What i give as an input are 2 strings:

    sentence_obama = 'Obama speaks to the media in Illinois'
    sentence_president = 'The president greets the press in Chicago'

The model i'm using is the one that you can find on the web: word2vec-google-news-300 I load it with this code:

wv = api.load("word2vec-google-news-300")

It give me reasonable results. Here it's where the problem starts. For what i can read from the documentation here it seems the wmd take as input a list of string and not a string like i do!

def preprocess(sentence):
   return [w for w in sentence.lower().split() if w not in stop_words]

sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sim = 1 - wv.wmdistance(sentence_obama, sentence_president)

When i follow the documentation i get results really different:

wmd using string as input: 0.5562025871542842
wmd using list of string as input: -0.0174646259300113

I'm really confused. Why is it working with string as input and it works better than when i give what the documentation is asking for?

Upvotes: 0

Views: 359

Answers (1)

gojomo
gojomo

Reputation: 54243

The function needs a list-of-string-tokens to give proper results: if your results pasing full strings look good to you, it's pure luck and/or poor evaluation.

So: why do you consider 0.556 to be a better value than -0.017?

Since passing the texts as plain strings means they are interpreted as lists-of-single-characters, the value there is going to be a function of how different the letters in the two texts are - and the fact that all English sentences of about the same length have very similar letter-distributions, means most texts will rate as very-similar under that error.

Also, similarity or distance values mainly have meaning in comparison to other pairs of sentences, not two different results from different processes (where one of them is essentially random). You shouldn't consider absolute values that are exceeding some set threshold, or close to 1.0, as definitively good. You should instead consider relative differences, between two similarity/distance values, to mean one pair is more similary/distant than another pair.

Finally: converting a distance (which goes from 0.0 for closest to infinity for furthest) to a similarity (which typically goes from 1.0 for most-similar to -1.0 or 0.0 for least-similar) is not usefully done via the formula you're using, similarity = 1.0 - distance. Because a distance can be larger than 2.0, you could have arbitrarily negative similarities with that approach, and be fooled to think -0.017 (etc) is bad, because it's negative, even if it's quite good across all the possible return values.

Some more typical distance-to-similarity conversions are given in another SO question:

How do I convert between a measure of similarity and a measure of difference (distance)?

Upvotes: 1

Related Questions