Reputation: 11890
According to https://code.google.com/archive/p/word2vec/:
It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1]. You can try out a simple demo by running demo-analogy.sh.
So we can try from the supplied demo script:
+ ../bin/word-analogy ../data/text8-vector.bin
Enter three words (EXIT to break): paris france berlin
Word: paris Position in vocabulary: 198365
Word: france Position in vocabulary: 225534
Word: berlin Position in vocabulary: 380477
Word Distance
------------------------------------------------------------------------
germany 0.509434
european 0.486505
Please note that paris france berlin
is the input hint the demo suggest. The problem is that I'm unable to reproduce this behavior if I open the same word vectors in Gensim
and try to compute the vectors myself. For example:
>>> word_vectors = KeyedVectors.load_word2vec_format(BIGDATA, binary=True)
>>> v = word_vectors['paris'] - word_vectors['france'] + word_vectors['berlin']
>>> word_vectors.most_similar(np.array([v]))
[('berlin', 0.7331711649894714), ('paris', 0.6669869422912598), ('kunst', 0.4056406617164612), ('inca', 0.4025722146034241), ('dubai', 0.3934606909751892), ('natalie_portman', 0.3909246325492859), ('joel', 0.3843030333518982), ('lil_kim', 0.3784593939781189), ('heidi', 0.3782389461994171), ('diy', 0.3767407238483429)]
So, what is the word analogy actually doing? How should I reproduce it?
Upvotes: 5
Views: 4716
Reputation: 54213
You should be clear about exactly which word-vector set you're using: different sets will have a different ability to perform well on analogy tasks. (Those trained on the tiny text8
dataset might be pretty weak; the big GoogleNews
set Google released would probably do well, at least under certain conditions like discarding low-frequnecy words.)
You're doing the wrong arithmetic for the analogy you're trying to solve. For an analogy "A is to B as C is to ?" often written as:
A : B :: C : _?_
You begin with 'B', subtract 'A', then add 'C'. So the example:
France : Paris :: Italy : _?_
...gives the formula in your excerpted text:
wv('Paris') - wv('France') + wv('Italy`) = target_coordinates # close-to wv('Rome')
And to solve instead:
Paris : France :: Berlin : _?_
You would try:
wv('France') - wv('Paris') + wv('Berlin') = target_coordinates
...then see what's closest to target_coordinates
. (Note the difference in operation-ordering to your attempt.)
You can think of it as:
Note also that gensim
's most_similar()
takes multiple positive and negative word-examples, to do the arithmetic for you. So you can just do:
sims = word_vectors.most_similar(positive=['France', 'Berlin'], negative=['Paris'])
Upvotes: 5
Reputation: 464
It should be just element-wise addition and subtraction of vectors. And cosine distance to find the most similar ones. However, if you use original word2vec embeddings, there is difference between "paris" and "Paris" (strings were not lowered or lemmatised).
You may also try:
v = word_vectors['France'] - word_vectors['Paris'] + word_vectors['Berlin']
or
v = word_vectors['Paris'] - word_vectors['France'] + word_vectors['Germany']
because you should compare identical concepts (city - country + country -> another city)
Upvotes: 4