Reputation: 1074
Let's say I have trained two separate GloVe vector space models (using text2vec
in R
) based on two different corpora. There could be different reasons for doing so: the two base corpora may come from two different time periods, or two very different genres, for example. I would be interested in comparing the usage/meaning of words between these two corpora. If I simply concatenated the two corpora and their vocabularies, that would not work (the location in the vector space for word pairs with different usages would just be somewhere in the "middle").
My initial idea was to train just one model, but when preparing the texts, append a suffix (_x, _y) to each word (where x and y stand for the usage of word A in corpus x/y), as well as keep a separate copy of each corpus without the suffixes, so that the vocabulary of the final concatenated training corpus would consist of: A, A_x, A_y, B, B_x, B_y ... etc, e.g.:
this is an example of corpus X
this be corpus Y yo
this_x is_x an_x example_x of_x corpus_x X_x
this_y be_y corpus_y Y_y yo_y
I figured the "mean" usages of A and B would serve as sort of "coordinates" of the space, and I could measure the distance between A_x and A_y in the same space. But then I realized since A_x and A_y never occur in the same context (due to the suffixation of all words, including the ones around them), this would probably distort the space and not work. I also know there is something called an orthogonal procrustes problem, which relates to aligning matrices, but I wouldn't know how to implement it for my case.
What would be a reasonable way to fit two GloVe models (preferably in R
and so that they work with text2vec
) into a common vector space, if my final goal is to measure the cosine similarity of word pairs, which are orthographically identical, but occur in two different corpora?
Upvotes: 3
Views: 452
Reputation: 4595
I see 2 possible solutions:
Also check http://nlp.stanford.edu/projects/histwords/, mb it will help with methodology.
Seems this is a good question for https://math.stackexchange.com/
Upvotes: 1