minji
minji

Reputation: 1

Identifying Redundancy in Operations within doc2vec Model

I've noticed potential redundancy in the doc2vec model during similarity calculations. It appears that when selecting recommended recipes, recalculating all vectors and similarity increases exponentially as the number of recipes grows. I'm looking to address this issue and would like to understand how redundancy in operations within the doc2vec model might be minimized.

I'm seeking information regarding potential techniques or optimization strategies applied within the doc2vec model to minimize redundant operations during similarity calculations and vector computations. Could someone suggest how to identify such redundancy within the doc2vec model? Are there specific parts contributing to redundant computations?

If possible, providing an example from the code or a specific section where redundancy seems to occur would be greatly appreciated.I'm beggin. Please help me

Thank you.

I want to know where redundancy in operations occurs in similarity calculation within the doc2vec model on GitHub

GitHub url: https://github.com/piskvorky/gensim/blob/develop/gensim/similarities/docsim.py

Upvotes: 0

Views: 24

Answers (1)

gojomo
gojomo

Reputation: 54153

When you perform the default .most_similar() operation of a Gensim Doc2Vec model's .dv doc-vectors collection, it does two things:

  • a pairwise similarity-calculation against all known doc-vectors (as learned during the model's training)
  • a sort-and-return of the top-10 results

That 1st step will dominate the time taken, and is linear in the number of doc-vectors the model stores, and is dominated by a single bulk dot-product, done as a single call to the underlying optimized BLAS library.

(The 2nd step can be more-than-linear, but given that it is simple scalar comparisons, & only done to exact-resolution for the top-10 results, is comparatively quick.)

If you supply a non-default topn of more than 10, perhaps up to the full count of known doc-vectors, that balance might change a bit - but in no case should it become slower in a manner that is 'exponential' in the size of the model.

(You can also supply topn=0 to optionally get an array of all similarities, in the order the doc-vectors are stored, without any sorting by magnitude – and thus get results in time strictly linear with the number of candidate doc-vectors compared.)

You point to a different source code URL within the Gensim project – for the docsim.py file – which includes some utilities for other kinds of bulk lookups & sorts, but don't say either what part of that module you're using, or show any code. So, it's hard to guess what you might be running into.

But note: some of docsim.py was mainly developed for other text-vector models (like sparse 'bag of words' doc-representations), including those that'd need to be sharded-to-disk because they overflow main RAM.

If your main approach is the Doc2Vec class – which in normal operation is only practical if your full set of dense doc-vectors can be in the RAM model at once – you may not need to use anything code the docsim.py file.

If you do for some reason need to stick with your current approach, everything you need to understand the code should be in the full source you've already linked-to.

Often finding potential performance optimizations involves profiling code with extra tools, specifically with your data/use-cases, which can highlight exactly which ranges of code are taking the most time. You can then direct extra attention to just those critical, delay-contributng areas, as a place to adjust algorithms, cache reusable results, etc.

Bu without even seeing your code, or the test results that implied some concerning performance issue, it's not really possible for someone else to just point at problems or potential improvements.

While it's possible there's some simple inefficiency everyone else overlooked, more often, it were that simple, the code would already be better.

So if you want to get help to dig deeper:

  • show your evaluation code/data & timing results - to be clearer about what's involved, and what's a concern
  • share your theories, perhaps as aided by profiling output or whatever other experiments you've run that imply to you that it could be better

It may be possible that a bug or inefficient choice in your code is contributing, so the proper fix would be outside the Gensim classes. (For example: are you asking for repeated identical expensive calculations that your code could cache-for-reuse?)

If you add more details to this question (you can edit it to expand) or post a new more-detailed question when you have a clearer target, I'll be happy to review/comment. (Alternatively, if at any point you have enough evidence to be sure, & demonstrate, a fixable undesirable performance problem with Gensim code, you could also file it as a bug/feature request in the Gensim issue-tracker at Github.)

Upvotes: 0

Related Questions