Can the gensim pretrained models be used for doc2vec models?

Question

I am trying to load a pretrained model listed here to test the similarity of a handful of paragraphs.

Can gensim's pretrained models only be used with word-level vectors, or can the models also be used for document-length vectors?

gojomo · Accepted Answer

Most of the models currently listed there (as of 2020-11-21) are just sets of word-vectors - allowing lookup of vectors, by individual word, but not the full algorithmic model that would allow for followup training. (The only exception I see is the FastText model, which *might8 be a full FastText model, I'm not sure. But even there, the model only reports word-vectors for known words, or synthesizes a vector for out-of-vocabulary words - with no native method of creating vectors for larger texts.)

From any set of word-vectors, there are some crude ways to either create a simple vector for larger texts (such as averaging all the word-vectors for the words of the text together), or do other comparisons between sets of words using the word-vectors to influence the similarity (such as the "Word Mover's Distance" algorithm, available on Gensim word-vector sets as wmdistance().)

But none of those models availabe via the gensim.downloader utility are for algorithms that inherently create vectors for larger texts (such as Doc2Vec).

(Separately: I would strongly recommend downloading models explicitly, as data, from their original locations, rather than using the gensim.downloader utility. It obscures key aspects of the process, including running extra 'shim' code for each dataset that is downloaded outside of normal code-versioning & package-installation processes, a practice that I consider recklessly insecure.)

Can the gensim pretrained models be used for doc2vec models?

Answers (1)

Related Questions