What is the operation behind the word analogy in Word2vec?

Question

According to https://code.google.com/archive/p/word2vec/:

It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1]. You can try out a simple demo by running demo-analogy.sh.

So we can try from the supplied demo script:

+ ../bin/word-analogy ../data/text8-vector.bin
Enter three words (EXIT to break): paris france berlin

Word: paris  Position in vocabulary: 198365

Word: france  Position in vocabulary: 225534

Word: berlin  Position in vocabulary: 380477

                                              Word              Distance
------------------------------------------------------------------------
                                           germany      0.509434
                                          european      0.486505

Please note that paris france berlin is the input hint the demo suggest. The problem is that I'm unable to reproduce this behavior if I open the same word vectors in Gensim and try to compute the vectors myself. For example:

>>> word_vectors = KeyedVectors.load_word2vec_format(BIGDATA, binary=True)
>>> v = word_vectors['paris'] - word_vectors['france'] + word_vectors['berlin']
>>> word_vectors.most_similar(np.array([v]))
[('berlin', 0.7331711649894714), ('paris', 0.6669869422912598), ('kunst', 0.4056406617164612), ('inca', 0.4025722146034241), ('dubai', 0.3934606909751892), ('natalie_portman', 0.3909246325492859), ('joel', 0.3843030333518982), ('lil_kim', 0.3784593939781189), ('heidi', 0.3782389461994171), ('diy', 0.3767407238483429)]

So, what is the word analogy actually doing? How should I reproduce it?

gojomo · Accepted Answer

You should be clear about exactly which word-vector set you're using: different sets will have a different ability to perform well on analogy tasks. (Those trained on the tiny text8 dataset might be pretty weak; the big GoogleNews set Google released would probably do well, at least under certain conditions like discarding low-frequnecy words.)

You're doing the wrong arithmetic for the analogy you're trying to solve. For an analogy "A is to B as C is to ?" often written as:

A : B :: C : _?_

You begin with 'B', subtract 'A', then add 'C'. So the example:

France : Paris :: Italy : _?_

...gives the formula in your excerpted text:

wv('Paris') - wv('France') + wv('Italy`) = target_coordinates  # close-to wv('Rome')

And to solve instead:

Paris : France :: Berlin : _?_

You would try:

wv('France') - wv('Paris') + wv('Berlin') = target_coordinates

...then see what's closest to target_coordinates. (Note the difference in operation-ordering to your attempt.)

You can think of it as:

start at a country-vector ('France')
subtract the (country&capital)-vector ('Paris'). This leaves you with an interim vector that's, sort-of, "zero" country-ness, and "negative" capital-ness.
add another (country&capital)-vector ('Berlin'). This leaves you with a result vector that's, again sort-of, "one" country-ness, and "zero" capital-ness.

Note also that gensim's most_similar() takes multiple positive and negative word-examples, to do the arithmetic for you. So you can just do:

sims = word_vectors.most_similar(positive=['France', 'Berlin'], negative=['Paris'])

What is the operation behind the word analogy in Word2vec?

Answers (2)

Related Questions