Roger
Roger

Reputation: 1133

Mixing Word Vectors from Different models

While working with Word2Vec to find ways to disambiguate word senses using word vectors representation, one strategy that came to my mind was the following:

Train a model using a corpus where you know the senses of the words of interest, in my case english words which are also gene names. Then, whenever a paragraph of interest appeared in a unknown corpus, train a small model using the paragraph with the word of interest. With the word vectors built from this snippet, compare the representations of the specific word in the known context and in the unknown context to see how close they are in the vector space.

While trying this approach, I noticed that even 2 models trained on the same corpus have quite different word representations for the same word. In other words, the cosine similarity between these 2 word vectors is quite low.

So my question is, is this difference due to the model somehow building different base vectors to represent the space? And if so, is there a way to lock those to the euclidian one during the training? Or is the difference due to something else?

Upvotes: 0

Views: 1086

Answers (2)

gojomo
gojomo

Reputation: 54173

To add to the prior answer & comment:

A technique that might have some chance of working would be if you held all word-vectors constant except for the vector for a single word of interest constant during your training. (That is, initialize the new model with the prior weights, lock all other words against training-changes, then perform the training using the new text, and see how much the word-vector-of-interest moves.)

A single paragraph is still a tiny amount of data for such algorithms, and normal word use (even of a single word-sense) will have wildly-varying contexts. But this approach could help offset the randomization in serial model-trainings, and work better with the limitation of a tiny training-set.

Upvotes: 1

dkar
dkar

Reputation: 2123

Adding to lejlot comment: Every time you run the model, it starts from a different random point and ends to a different local optimum, so there is no way two different models to return similar vectors, even if you train them on the same corpus. What you should expect though (if you train models on the same corpus) is that the word relationships will be analogous from model to model, e.g. the cosine similarity between 'cat' and 'dog' in Model 1 will be similar to the cosine similarity of the same words in Model 2.

Regarding WSI, your method is not going to work anyway since (again as noted by lejlot) it's not possible to train a vector using just a paragraph. A simple way to go (not involving additional neural layers) is after you have trained your model you do the following:

  1. For each sentence in which your target word occurs, create a vector representing the context (e.g. by adding the vectors of all other words in the same sentence).
  2. Cluster these context vectors with your favourite clustering algorithm and, based on the clusters, create sense vectors (e.g. by taking the centroid of each cluster)

This method has been developed by Hinrich Schutze 20 years ago and is still pretty much the standard approach for WSI with distributional models of meaning.

Upvotes: 1

Related Questions