Reputation: 68
I've been hearing a lot about the neural network Word2vec, which is able to solve literary analogies with respect to literary context. People often describe the weights as trained bias put in place by previously labeled data, but what is often not described is what do these weights actually calculate. In the case of Word2vec, what do their 300 hidden weights calculate? Contexual position? Connotation frequencies? A diversity of numericalized grammatical features?
From my standpoint, I've been able to visualize neural networks up to the complexity of an algorithm trained for boolean XOR processing. In that case, I know that weights add numericalized bias to the output, which simply gives a 0 or 1 for False and True, respectively. However, I can't make that connection to Word2vec, which is in a completely different genre (literary). Can someone explain in detail?
Upvotes: 0
Views: 301
Reputation: 54173
The weights aren't really 'measuring' anything.
Within the constraints of the shallow neural-network architecture, the weights are incrementally optimized, from a starting random initialization, to become better and better at predicting a 'target'/'center' word from its neighbors within the configured context window.
More specifically, the network is fed individual examples of (actual-context -> actual-word)
, and via forward-propagation its current predictions of possible target words (interpreted via specific output nodes) are observed. Then, via back-propagation, small adjustments are made to weights to make the predictions slightly-better.
Of course, since a context doesn't always perfectly predict a word – many identical contexts will have different target words – different examples pull the weights in different ways, in a tug-of-war. And no training corpus will reflect the full variety of possible/likely expression, just a subset. And the model itself is of limited size – far smaller than the training data – so it is in a vague sense 'compressing' the corpus down to its most reliable patterns.
Eventually, the model gets as good as is possible, within its limited size and mechanism-of-operation, at these micro-prediction tasks. This is "convergence" of the optimization process: further training can't find any other weight-nudges that reliably improve overall performance. (If they improve some examples, they hurt others.)
At this stage, it turns out that the process of forcing all the words, and all the usage examples, into the model's limited, shared represenations creates the 'word-vectors' people find useful. (The word-vectors can be thought of as one 'projection-layer' inside the model, that turns one-hot word-vectors of dimensionality equal the count of all known words, into dense-vectors of far fewer dimensions.) Words humans would perceive as similar tend to be close – as keeping them in similar positions has improved predictions.
And further, vague directions in the vector-space tend to correlated with aspects of human understanding. (These aren't neatly mapped to axes, instead shearing across all of the dimensions at once.) And that gives rise to the impressive ability to mimic analogical reasoning via vector-math.
The end weights have vague relationships to other features of the text – cooccurrences, frequencies, relative positions, human-understandable grammar – but they are really only end-products of the training/optimization process: what series of nudges, from initially-random positions, made predictions better?
The basis for why this works well for certain tasks is more practical/empirical than fully theoretically-grounded.
Upvotes: 1