Reputation: 752
I have a doubt in word2vec and word embedding , I have downloaded GloVe pre-trained word embedding (shape 40,000 x 50) and using this function to extract information from that:
import numpy as np
def loadGloveModel(gloveFile):
print ("Loading Glove Model")
f = open(gloveFile,'r')
model = {}
for line in f:
splitLine = line.split()
word = splitLine[0]
embedding = np.array([float(val) for val in splitLine[1:]])
model[word] = embedding
print ("Done.",len(model)," words loaded!")
return model
Now if I call this function for word 'python' something like :
print(loadGloveModel('glove.6B.100d.txt')['python'])
it gives me 1x50 shape vector like this:
[ 0.24934 0.68318 -0.044711 -1.3842 -0.0073079 0.651
-0.33958 -0.19785 -0.33925 0.26691 -0.033062 0.15915
0.89547 0.53999 -0.55817 0.46245 0.36722 0.1889
0.83189 0.81421 -0.11835 -0.53463 0.24158 -0.038864
1.1907 0.79353 -0.12308 0.6642 -0.77619 -0.45713
-1.054 -0.20557 -0.13296 0.12239 0.88458 1.024
0.32288 0.82105 -0.069367 0.024211 -0.51418 0.8727
0.25759 0.91526 -0.64221 0.041159 -0.60208 0.54631
0.66076 0.19796 -1.1393 0.79514 0.45966 -0.18463
-0.64131 -0.24929 -0.40194 -0.50786 0.80579 0.53365
0.52732 0.39247 -0.29884 0.009585 0.99953 -0.061279
0.71936 0.32901 -0.052772 0.67135 -0.80251 -0.25789
0.49615 0.48081 -0.68403 -0.012239 0.048201 0.29461
0.20614 0.33556 -0.64167 -0.64708 0.13377 -0.12574
-0.46382 1.3878 0.95636 -0.067869 -0.0017411 0.52965
0.45668 0.61041 -0.11514 0.42627 0.17342 -0.7995
-0.24502 -0.60886 -0.38469 -0.4797 ]
I need help in understanding the output matrix. What does these value represent and there significance in generating new word
Upvotes: 3
Views: 4462
Reputation: 2424
In a nutshell a vector word in word embedding represents words' contexts. Then, it "embeds" the meaning because "similar words have similar contexts". Furthermore, you can use this idea to extend to "whatever embedding" just train a neural network with a lot of context of something (sentence, paragraph, documents, images and son on) the resulting vector of dimensions d will content a valuable representation of your objects.
This is a good post to get a complete landscape https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
Upvotes: 1
Reputation: 32051
Here is a nice article explaining the underlying intuition and meaning of word2vec vectors.
https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
There's no universal way to know exactly what an embedding means, the results discussed in that article were discovered by looking at many embeddings where one value is varied, and noting the differences. Each word2vec model will come up with its own unique embedding. The individual values of the embedding have some semantic meaning in the language.
What word2vec is giving you is converting a sparse one-hot vector representing each word in a dictionary of potentially millions of words into a small dense vector where each value has some semantic meaning in the language. Large sparse inputs are usually bad for learning, small, dense, meaningful inputs are usually good.
Upvotes: 2
Reputation: 54153
In usual word2vec/GLoVe, the individual per-dimension coordinates don't specifically mean anything. The training process instead forces words to be in valuable/interesting relative positions against each other.
All meaning is in the relative distances and relative directions, not specifically aligned with exact coordinate axes.
Consider a classic illustrative example: the ability of word-vectors to solve an analogy like "man is to king as woman is to ?" – by finding the work queen near some expected point in the coordinate-space.
There will be neighborhoods of the word-vector space that include lots of related words of one type (man, men, male, boy, etc. - or king, queen, prince, royal, etc.). And further, there may be some directions that match well with human ideas of categories and magnitude (more woman-like, more-monarchical, higher-ranked, etc.). But these neighborhoods and directions generally are not 1:1 correlated with exact axis-dimensions of the space.
And further, there are many possible near rotations/reflections/transformations of a space full of word-vectors which are just-as-good as each other for typical applications, but totally different in their exact coordinates for each word. That is, all the expected relative distances are similar – words have the 'right' neighbors, in the right ranked order – and there are useful directional patterns. But the individual words in each have no globally 'right' or consistent position – just relatively useful positions.
Even if in one set of vectors there appears to be some vague correlation – like "high values in dimension 21 correlate with idea of 'maleness' – it's likely to be a coincidence of that vector-set, not a reliable relationship.
(There are some alternate techniques which try to force the individual dimensions to be mapped to more-interpretable concepts – see as one example NNSE – but their use seems less common.)
Upvotes: 2