fhe
fhe

Reputation: 57

What is the Gensim word2vec output

I want to use gensim word2vec as input for neural network. I have 2 questions:

1) gensim.models.Word2Vec get as parameter the size. How this parameter is used? and size of what?

2) Once trained what is the output of gensim word2vec? As i could see this is not a probability values (not between 0 and 1). It seems to me for each word vector we get a distance (cosinus) between this word and some other words (but which words exactly?)

Thanks for your response.

Upvotes: 1

Views: 1752

Answers (2)

Ayman Salama
Ayman Salama

Reputation: 449

Dimension Size in word2vec**

  • Word2vec is used to create a vector space that represents words based on the trained corpus.
  • The vector is a mathematical representation of the word compared to other words in the given corpus. The dimensions size is the vector length.
  • Performing mathematical operations on the vectors represent the relationship between words.
  • as the vector of "man" and "king" will be close and same for the vector of "Faris" and "France".
  • If the size is too small like two or three dimensions, the information representation will be very limited.
  • The dimensions can be simplified as a linkage between different words. Words can be linked to each other in different dimensions based on how the words are positioned to each other in the corpus.

How to use the vectors

  • The vector by itself is useless and the numbers represent the position of the word with relations to all other words in the corpus
  • The vector can be meaningful when measured against another vector
  • cosine similarity is one of the common methods to measure the similarity between different words.

Good luck

Upvotes: 0

Ashutosh Baheti
Ashutosh Baheti

Reputation: 420

Ans to 1 -> The size parameter is the dimension of the word vectors i.e. each vector will be having 100 dimensions if size=100

Ans to 2 -> You can save the word vectors using this function save_word2vec_format(fname="vectors.txt", fvocab=None, binary=False) ref. This will save a file "vectors.txt" which will have first line as <size of the vocabulary> <dimensions> and rest of the lines will be of the form <word> <vector of size dimension>.

Sample for "vectors.txt":

297820 100
the -0.18542234751 0.138813291635 0.0392148854213 0.0238721499736 -0.0443151295365 0.03226302388 -0.168626211895 -0.17397777844 -0.0547546409461 0.166621666046 0.0534506882806 0.0774947957067 -0.180520283779 -0.0938140452702 -0.0354599008902 -0.0533488133527 -0.0667684564816 -0.0210904306995 -0.103069115604 -0.138712344952 -0.035142440978 -0.125067138202 0.0514192233164 -0.142052171747 0.0795726729387 0.0310433094806 -0.00666224898992 0.047268806263 0.0339849190176 -0.181107631029 0.0477396587205 0.0483130822899 -0.090229393762 0.0224528628225 0.190814060668 -0.179506639849 0.00034066604609 0.0639057478 0.156444383949 -0.0366888977431 -0.170674385275 -0.053907152935 0.106572313582 0.0724497821903 -0.00848717936216 0.124053494271 -0.0420715605081 0.0460277422205 -0.0514693485657 0.132215091575 -0.0429308836475 -0.111784875385 -0.0543172053216 0.0849476776796 -0.015301892652 0.00992711997251 -0.00566113637219 0.00136359242972 -0.0382116842516 0.0681229985191 0.0685156463052 0.0759072640845 -0.0238136705161 0.168710450161 0.00879930186352 -0.179756801973 -0.210286559709 -0.161832152064 -0.0212640125813 -0.0115905356526 -0.0949562511822 0.126493155131 0.0215821686774 -0.164276918273 -0.0573806470616 -0.0147266125919 0.0566350339785 -0.0276969849679 0.0178970346094 0.0599163813161 0.0919867942845 0.172071394538 0.0714226787026 0.109037733251 0.00403647493576 0.044853743905 -0.0915639785243 -0.0242494817113 0.0705554654776 0.255584701079 0.001309754199 0.0872413719572 -0.0376289782286 0.158184379871 0.109245196088 -0.0727554069742 0.168820215174 0.0454895919746 0.0741726055733 -0.134467710995
...
...

Upvotes: 4

Related Questions