Sarp Kaya
Sarp Kaya

Reputation: 3784

Understanding word2vec text representation

I would like to implement the distance part of word2vec in my program. Unfortunately it's not in C/C++ or Python, but first I did not understand the non-binary representation. This is how I get the file ./word2vec -train text8-phrase -output vectorsphrase.txt -cbow 0 -size 300 -window 10 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 0

When I check vectorsphrase.txt file for france all I get is:

 france -0.062591 0.264201 0.236335 -0.072601 -0.094313 -0.202659 -0.373314 0.074684 -0.262307 0.139383 -0.053648 -0.154181 0.126962 0.432593 -0.039440 0.108096 0.083703 0.148991 0.062826 0.048151 0.005555 0.066885 0.004729 -0.013939 -0.043947 0.057280 -0.005259 -0.223302 0.065608 -0.013932 -0.199372 -0.054966 -0.026725 0.012510 0.076350 -0.027816 -0.187357 0.248191 -0.085087 0.172979 -0.116789 0.014136 0.131571 0.173892 0.316052 -0.045492 0.057584 0.028944 -0.193623 0.043965 -0.166696 0.111058 0.145268 -0.119645 0.091659 0.056593 0.417965 -0.002927 -0.081579 -0.021356 0.030447 0.052507 -0.109058 -0.011124 -0.136975 0.104396 0.069319 0.030266 -0.193283 -0.024614 -0.025636 -0.100761 0.032366 0.069175 0.200204 -0.042976 -0.045123 -0.090475 0.090071 -0.037075 0.182373 0.151529 0.080198 -0.024067 -0.196623 -0.204863 0.154429 -0.190242 -0.063265 -0.323000 -0.109863 0.102366 -0.085017 0.198042 -0.033342 0.119225 0.176891 0.214628 0.031771 0.168739 0.063246 -0.147353 -0.003526 0.138835 -0.172460 -0.133294 -0.369451 0.063572 0.076098 -0.116277 0.208374 0.015783 0.145032 0.090530 -0.090470 0.109325 0.119184 0.024838 0.101194 -0.184154 -0.161648 -0.039929 0.079321 0.029462 -0.016193 -0.005485 0.197576 -0.118860 0.019042 -0.137174 -0.047933 -0.008472 0.092360 0.165395 0.013253 -0.099013 -0.017355 -0.048332 -0.077228 0.034320 -0.067505 -0.050190 -0.320440 -0.040684 -0.106958 -0.169634 -0.014216 0.225693 0.345064 0.135220 -0.181518 -0.035400 -0.095907 -0.084446 0.025784 0.090736 -0.150824 -0.351817 0.174671 0.091944 -0.112423 -0.140281 0.059532 0.002152 0.127812 0.090834 -0.130366 -0.061899 -0.280557 0.076417 -0.065455 0.205525 0.081539 0.108110 0.013989 0.133481 -0.256035 -0.135460 0.127465 0.113008 0.176893 -0.018049 0.062527 0.093005 -0.078622 -0.109232 0.065856 0.138583 0.097186 -0.124459 0.011706 0.113376 0.024505 -0.147662 -0.118035 0.129616 0.114539 0.165050 -0.134871 -0.036298 -0.103552 -0.108726 0.025950 0.053895 -0.173731 0.201482 -0.198697 -0.339452 0.166154 -0.014059 0.022529 0.212491 -0.051978 0.057627 0.198319 0.092990 -0.171989 -0.060376 0.084172 -0.034411 -0.065443 0.054581 -0.024187 0.072550 0.113017 0.080476 -0.170499 0.148091 -0.010503 0.158095 0.111080 0.007598 0.042551 -0.161005 -0.078712 0.318305 -0.011473 0.065593 0.121385 0.087249 -0.011178 0.053639 -0.100713 0.168689 0.120121 -0.058025 -0.161788 -0.101135 -0.080533 0.120502 -0.099477 0.187640 -0.054496 0.180532 -0.097961 0.049633 -0.019596 0.145623 0.284261 0.039761 0.053866 0.089779 -0.000676 -0.081653 0.082819 0.263937 -0.141818 0.011605 -0.028248 -0.020573 0.091329 -0.080264 -0.358647 -0.134252 0.115414 -0.066107 0.150770 -0.018897 0.168325 0.111375 -0.091567 -0.152783 -0.034834 -0.418656 -0.091504 -0.134671 0.051754 -0.129495 0.230855 -0.339259 0.208410 0.191621 0.007837 -0.016602 -0.131502 -0.059481 -0.185196 0.303028 0.017646 -0.047340 

So apart from cosine values I don't get anything else, and when I run the distance and type france I get

Word Cosine distance

            spain              0.678515
          belgium              0.665923
      netherlands              0.652428
            italy              0.633130
      switzerland              0.622323
       luxembourg              0.610033
         portugal              0.577154
           russia              0.571507
          germany              0.563291
        catalonia              0.534176

So, from the given probabilities, how do I link it to other words, and how do I know which one belongs to which?

Upvotes: 3

Views: 1435

Answers (2)

Mehdi
Mehdi

Reputation: 4318

First of all, you should know that word2vec is made for research purposes. It generates vector representation for words. e.g. if you train a model with 50-dimension, you will get a 50 number in front of each word (like 'france'). For the sake of this example just imagine 2-dimension vector space, and two words 'france' and 'spain':

france -0.1 0.2
spain -0.3 0.15

The cosine similarity (they call it distance) is the normalized dot product between two vectors. Basically, sum of products of corresponding numbers divided by length of both vectors. The cosine similarity of these two vectors is:

    ( (-0.1) * (-0.3) + (0.2) * (0.15) )
    / sqrt((-0.1) * (-0.1) + (0.2) * (0.2))
    / sqrt((-0.3) * (-0.3) + (0.15) * (0.15)) = 0.8

So, those numbers in distance tool kit are not probabilities. (As a reminder, cosine can be negative). The similarity measure intuitionally is the only thing you can get when you want to connect two words. But Mikolov (the word2vec creator) showed that in word2vec models you can actually have other semantic relations through simple vector operations (the paper). If you are looking for such tool in word2vec, it's called word-analogy. It will return list of words for 3 given words. (the famous: king - man + woman = queen)

I hope I answered your question.

Upvotes: 6

Luca Fiaschi
Luca Fiaschi

Reputation: 3205

I believe what you get for france is the vector associated with that word. Cosine similarity can be computed from those vectors as http://en.wikipedia.org/wiki/Cosine_similarity. These have no probabilistic interpretation as far a I know. You can retrieve semantically close words by setting a threshold on the cosine similarity of the vectors.

Upvotes: 2

Related Questions