jrichview
jrichview

Reputation: 390

totalEstimatedMatches behavior with Microsoft (Bing) Cognitive search API (v5)

Recently converted some Bing Search API v2 code to v5 and it works but I am curious about the behavior of "totalEstimatedMatches". Here's an example to illustrate my question:

A user on our site searches for a particular word. The API query returns 10 results (our page size setting) and totalEstimatedMatches set to 21. We therefore indicate 3 pages of results and let the user page through.

When they get to page 3, totalEstimatedMatches returns 22 rather than 21. Seems odd that with such a small result set it shouldn't already know it's 22, but okay I can live with that. All results are displayed correctly.

Now if the user pages back again from page 3 to page 2, the value of totalEstimatedMatches is 21 again. This strikes me as a little surprising because once the result set has been paged through, the API probably ought to know that there are 22 and not 21 results.

I've been a professional software developer since the 80s, so I get that this is one of those devil-in-the-details issues related to the API design. Apparently it is not caching the exact number of results, or whatever. I just don't remember that kind of behavior in the V2 search API (which I realize was 3rd party code). It was pretty reliable on number of results.

Does this strike anyone besides me as a little bit unexpected?

Upvotes: 1

Views: 844

Answers (2)

Rob Truxal
Rob Truxal

Reputation: 6438

Revisiting the API & and I've come up with a way to paginate efficiently without having to use the "totalEstimatedMatches" return value:

class ApiWorker(object):
    def __init__(self, q):
        self.q = q
        self.offset = 0
        self.result_hashes = set()
        self.finished = False

    def calc_next_offset(self, resp_urls):
       before_adding = len(self.result_hashes)
       self.result_hashes.update((hash(i) for i in resp_urls)) #<==abuse of set operations.
       after_adding = len(self.result_hashes)
       if after_adding == before_adding: #<==then we either got a bunch of duplicates or we're getting very few results back.
           self.complete = True
       else:
           self.offset += len(new_results)

    def page_through_results(self, *args, **kwargs):
        while not self.finished:
            new_resp_urls = ...<call_logic>...
            self.calc_next_offset(new_resp_urls)
            ...<save logic>...
        print(f'All unique results for q={self.q} have been obtained.')

This^ will stop paginating as soon as a full response of duplicates have been obtained.

Upvotes: 0

Rob Truxal
Rob Truxal

Reputation: 6438

Turns out this is the reason why the response JSON field totalEstimatedMatches includes the word ...Estimated... and isn't just called totalMatches:

"...search engine index does not support an accurate estimation of total match."

Taken from: News Search API V5 paging results with offset and count

As one might expect, the fewer results you get back, the larger % error you're likely to see in the totalEstimatedMatches value. Similarly, the more complex your query is (for example running a compound query such as ../search?q=(foo OR bar OR foobar)&...which is actually 3 searches packed into 1) the more variation this value seems to exhibit.

That said, I've managed to (at least preliminarily) compensate for this by setting the offset == totalEstimatedMatches and creating a simple equivalency-checking function.

Here's a trivial example in python:

while True:
    if original_totalEstimatedMatches < new_totalEstimatedMatches:
       original_totalEstimatedMatches = new_totalEstimatedMatches.copy()

       #set_new_offset_and_call_api() is a func that does what it says.
       new_totalEstimatedMatches = set_new_offset_and_call_api()
    else:
        break

Upvotes: 1

Related Questions