Vikrant
Vikrant

Reputation: 159

How to effectively tune the hyper-parameters of Gensim Doc2Vec to achieve maximum accuracy in Document Similarity problem?

I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.

At present I am removing those 400 documents and using remaining 19600 documents for training the doc2vec. Then I extract the vectors of train and test data. Now for each test data document, I find it's cosine distance with all the 19600 train documents and select the top 5 with least cosine distance. If the similar document marked is present in these top 5 then take it to be accurate. Accuracy% = No. of Accurate records / Total number of Records.

The other way I find similar documents is by using the doc2Vec most similiar method. Then calculate accuracy using the above formula.

The above two accuracy doesn't match. With each epoch one increases other decreases.

I am using the code given here: https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e. For training the Doc2Vec.

I would like to know how to tune the hyperparameters so that I can get making accuracy by using above-mentioned formula. Should I use cosine distance to find the most similar documents or shall I use the gensim's most similar function?

Upvotes: 1

Views: 5156

Answers (1)

gojomo
gojomo

Reputation: 54153

The article you've referenced has a reasonable exposition of the Doc2Vec algorithm, but its example code includes a very damaging anti-pattern: calling train() multiple times in a loop, while manually managing alpha. This is hardly ever a good idea, and very error-prone.

Instead, don't change the default min_alpha, and call train() just once with the desired epochs, and let the method smoothly manage the alpha itself.

Your general approach is reasonable: develop a repeatable way of scoring your models based on some prior ideas of what, then try a wide range of model parameters and pick the one that scores best.

When you say that your own two methods of accuracy calculation don't match, that's a little concerning, because the most_similar() method does in fact check your query-point against all known doc-vectors, and returns those with the greatest cosine-similarity. Those should be identical as those that you've calculated to have the least cosine-distance. If you added to your question your exact code – how you're calculating cosine-distances, and how you're calling most_similar() – then it would probably be clear what subtle differences or errors are the cause of the discrepancy. (There shouldn't be any essential difference, but given that: you'll likely want to use the most_similar() results, because they're known non-buggy, and use efficient bulk array library operations that are probably faster than whatever loop you've authored.)

Note that you don't necessarily have to hold back your set of known-highly-similar document pairs. Since Doc2Vec is an unsupervised algorithm, you're not feeding it the preferred "make sure these documents are similar" results during training. It's fairly reasonable to train on the full set of documents, then pick the model that best captures your desired most-similar relationships, and trust that the inclusion of more documents actually helped you find the best parameters.

(Such a process might, however, slightly over-estimate the expected accuracy on future unseen docs, or some other hypothetical "other 20K" training documents. But it would still be plausibly finding the "best possible" metaparameters given your training data.)

(If you don't feed them all during training, then during testing you'll need to be using infer_vector() for the unseen docs, rather than just looking up the learned vectors from training. You haven't shown your code for such scoring/inference, but that's another step that might be done wrong. If you just train vectors for all available docs together, that possibility for error furing late re-inferencing is eliminated.)

Checking if desired docs are in the top-5 (or top-N) most-similar is just one way to score a model. Another way, that was used in a couple of the original 'Paragraph Vector' (Doc2Vec) papers, is for each such pair of known-related documents, also pick another random document. Count the model as accurate each time it reports the known-similar docs as closer to each other than the 3rd randomly-chosen document.

Such an evaluation isn't perfect - sometimes it will mark the model as failing when the 3rd document, by chance, truly is an even-better similarity match, by pure luck. But overall, by leveraging known-existing relationships in a bulk & automatable way, it will generally give a higher score to models that better mimic the known-relationships.

In the original 'Paragraph Vector' papers, existing search-ranking systems (which reported certain text snippets in response to the same probe queries) or hand-curated categories (as in Wikipedia or Arxiv) were used to generate such evaluation pairs: texts in the same search-results-page, or same category, were checked to see if they were 'closer' inside a model to each other than other random docs.

If your question were expanded to describe more about some of the initial parameters you've tried (such as the full parameters you're supplying to Doc2Vec and train()), and what has seemed to help or hurt, it might then be possible to suggest other ranges of parameters worth checking.

Upvotes: 5

Related Questions