Reputation: 2704
I am sorry for my naivety, but I don't understand why word embeddings that are the result of NN training process (word2vec) are actually vectors.
Embedding is the process of dimension reduction, during the training process NN reduces the 1/0 arrays of words into smaller size arrays, the process does nothing that applies vector arithmetic.
So as result we got just arrays and not the vectors. Why should I think of these arrays as vectors?
Even though, we got vectors, why does everyone depict them as vectors coming from the origin (0,0)?
Again, I am sorry if my question looks stupid.
Upvotes: 8
Views: 4591
Reputation: 122052
What are embeddings?
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.
Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.
(Source: https://en.wikipedia.org/wiki/Word_embedding)
What is Word2Vec?
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.
Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
(Source: https://en.wikipedia.org/wiki/Word2vec)
What's an array?
In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.
An array is stored so that the position of each element can be computed from its index tuple by a mathematical formula.
The simplest type of data structure is a linear array, also called one-dimensional array.
What's a vector / vector space?
A vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars.
Scalars are often taken to be real numbers, but there are also vector spaces with scalar multiplication by complex numbers, rational numbers, or generally any field.
The operations of vector addition and scalar multiplication must satisfy certain requirements, called axioms, listed below.
(Source: https://en.wikipedia.org/wiki/Vector_space)
What's the difference between vectors and arrays?
Firstly, the vector in word embeddings is not exactly the programming language data structure (so it's not Arrays vs Vectors: Introductory Similarities and Differences).
Programmatically, a word embedding vector IS some sort of an array (data structure) of real numbers (i.e. scalars)
Mathematically, any element with one or more dimension populated with real numbers is a tensor. And a vector is a single dimension of scalars.
To answer the OP question:
Why are word embedding actually vectors?
By definition, word embeddings are vectors (see above)
Why do we represent words as vectors of real numbers?
To learn the differences between words, we have to quantify the difference in some manner.
Imagine, if we assign theses "smart" numbers to the words:
>>> semnum = semantic_numbers = {'car': 5, 'vehicle': 2, 'apple': 232, 'orange': 300, 'fruit': 211, 'samsung': 1080, 'iphone': 1200}
>>> abs(semnum['fruit'] - semnum['apple'])
21
>>> abs(semnum['samsung'] - semnum['apple'])
848
We see that the distance between fruit
and apple
is close but samsung
and apple
isn't. In this case, the single numerical "feature" of the word is capable of capturing some information about the word meanings but not fully.
Imagine the we have two real number values for each word (i.e. vector):
>>> import numpy as np
>>> semnum = semantic_numbers = {'car': [5, -20], 'vehicle': [2, -18], 'apple': [232, 1010], 'orange': [300, 250], 'fruit': [211, 250], 'samsung': [1080, 1002], 'iphone': [1200, 1100]}
To compute the difference, we could have done:
>>> np.array(semnum['apple']) - np.array(semnum['orange'])
array([-68, 761])
>>> np.array(semnum['apple']) - np.array(semnum['samsung'])
array([-848, 8])
That's not very informative, it returns a vector and we can't get a definitive measure of distance between the words, so we can try some vectorial tricks and compute the distance between the vectors, e.g. euclidean distance:
>>> import numpy as np
>>> orange = np.array(semnum['orange'])
>>> apple = np.array(semnum['apple'])
>>> samsung = np.array(semnum['samsung'])
>>> np.linalg.norm(apple-orange)
763.03604108849277
>>> np.linalg.norm(apple-samsung)
848.03773500947466
>>> np.linalg.norm(orange-samsung)
1083.4685043876448
Now, we can see more "information" that apple
can be closer to samsung
than orange
to samsung
. Possibly that's because apple
co-occurs in the corpus more frequently with samsung
than orange
.
The big question comes, "How do we get these real numbers to represent the vector of the words?". That's where the Word2Vec / embedding training algorithms (originally conceived by Bengio 2003) comes in.
Since adding more real numbers to the vector representing the words is more informative then why don't we just add a lot more dimensions (i.e. numbers of columns in each word vector)?
Traditionally, we compute the differences between words by computing the word-by-word matrices in the field of distributional semantics/distributed lexical semantics, but the matrices become really sparse with many zero values if the words don't co-occur with another.
Thus a lot of effort has been put into dimensionality reduction after computing the word co-occurrence matrix. IMHO, it's like a top-down view of how global relations between words are and then compressing the matrix to get a smaller vector to represent each word.
So the "deep learning" word embedding creation comes from the another school of thought and starts with a randomly (sometimes not-so random) initialized a layer of vectors for each word and learning the parameters/weights for these vectors and optimizing these parameters/weights by minimizing some loss function based on some defined properties.
It sounds a little vague but concretely, if we look at the Word2Vec learning technique, it'll be clearer, see
Here's more resources to read-up on word embeddings: https://github.com/keon/awesome-nlp#word-vectors
Upvotes: 9
Reputation: 2110
Famous Word2Vec implementation is CBOW + Skip-Gram
Your input for CBOW is your input word vector (each is a vector of length N; N = size of vocabulary). All these input word vectors together are an array of size M x N; M=length of words).
Now what is interesting in the graphic below is the projection step, where we force an NN to learn a lower dimensional representation of our input space to predict the output correctly. The desired output is our original input.
This lower dimensional representation P consists of abstract features describing words e.g. location, adjective, etc. (in reality these learned features are not really clear). Now these features represent one view on these words.
And like with all features, we can see them as high-dimensional vectors. If you want you can use dimensionality reduction techniques to display them in 2 or 3 dimensional space.
More details and source of graphic: https://arxiv.org/pdf/1301.3781.pdf
Upvotes: 1
Reputation: 6562
Each word is mapped to a point in d-dimension space (d is usually 300 or 600 though not necessary), thus its called a vector (each point in d-dim space is nothing but a vector in that d-dim space).
The points have some nice properties (words with similar meanings tend to occur closer to each other) [proximity is measured using cosine distance between 2 word vectors]
Upvotes: 1
Reputation: 53758
the process does nothing that applies vector arithmetic
The training process has nothing to do with vector arithmetic, but when the arrays are produced, it turns out they have pretty nice properties, so that one can think of "word linear space".
For example, what words have embeddings closest to a given word in this space?
Put it differently, words with similar meaning form a cloud. Here's a 2-D t-SNE representation:
Another example, the distance between "man" and "woman" is very close to the distance between "uncle" and "aunt":
As a result, you have pretty much reasonable arithmetic:
W("woman") − W("man") ≃ W("aunt") − W("uncle")
W("woman") − W("man") ≃ W("queen") − W("king")
So it's not far fetched to call them vectors. All pictures are from this wonderful post that I very much recommend to read.
Upvotes: 6