Reputation: 294
In gensim I have a trained doc2vec model, if I have a document and either a single word or two-three words, what would be the best way to calculate the similarity of the words to the document?
Do I just do the standard cosine similarity between them as if they were 2 documents? Or is there a better approach for comparing small strings to documents?
On first thought I could get the cosine similarity from each word in the 1-3 word string and every word in the document taking the averages, but I dont know how effective this would be.
Upvotes: 1
Views: 491
Reputation: 54243
There's a number of possible approaches, and what's best will likely depend on the kind/quality of your training data and ultimate goals.
With any Doc2Vec
model, you can infer a vector for a new text that contains known words – even a single-word text – via the infer_vector()
method. However, like Doc2Vec
in general, this tends to work better with documents of at least dozens, and preferably hundreds, of words. (Tiny 1-3 word documents seem especially likely to get somewhat peculiar/extreme inferred-vectors, especially if the model/training-data was underpowered to begin with.)
Beware that unknown words are ignored by infer_vector()
, so if you feed it a 3-word documents for which two words are unknown, it's really just inferring based on the one known word. And if you feed it only unknown words, it will return a random, mild initialization vector that's undergone no inference tuning. (All inference/training always starts with such a random vector, and if there are no known words, you just get that back.)
Still, this may be worth trying, and you can directly compare via cosine-similarity the inferred vectors from tiny and giant documents alike.
Many Doc2Vec
modes train both doc-vectors and compatible word-vectors. The default PV-DM mode (dm=1
) does this, or PV-DBOW (dm=0
) if you add the optional interleaved word-vector training (dbow_words=1
). (If you use dm=0, dbow_words=0
, you'll get fast training, and often quite-good doc-vectors, but the word-vectors won't have been trained at all - so you wouldn't want to look up such a model's word-vectors directly for any purposes.)
With such a Doc2Vec
model that includes valid word-vectors, you could also analyze your short 1-3 word docs via their individual words' vectors. You might check each word individually against a full document's vector, or use the average of the short document's words against a full document's vector.
Again, which is best will likely depend on other particulars of your need. For example, if the short doc is a query, and you're listing multiple results, it may be the case that query result variety – via showing some hits that are really close to single words in the query, even when not close to the full query – is as valuable to users as documents close to the full query.
Another measure worth looking at is "Word Mover's Distance", which works just with the word-vectors for a text's words, as if they were "piles of meaning" for longer texts. It's a bit like the word-against-every-word approach you entertained – but working to match words with their nearest analogues in a comparison text. It can be quite expensive to calculate (especially on longer texts) – but can sometimes give impressive results in correlating alternate texts that use varied words to similar effect.
Upvotes: 3