Reputation: 390
Recently converted some Bing Search API v2 code to v5 and it works but I am curious about the behavior of "totalEstimatedMatches". Here's an example to illustrate my question:
A user on our site searches for a particular word. The API query returns 10 results (our page size setting) and totalEstimatedMatches set to 21. We therefore indicate 3 pages of results and let the user page through.
When they get to page 3, totalEstimatedMatches returns 22 rather than 21. Seems odd that with such a small result set it shouldn't already know it's 22, but okay I can live with that. All results are displayed correctly.
Now if the user pages back again from page 3 to page 2, the value of totalEstimatedMatches is 21 again. This strikes me as a little surprising because once the result set has been paged through, the API probably ought to know that there are 22 and not 21 results.
I've been a professional software developer since the 80s, so I get that this is one of those devil-in-the-details issues related to the API design. Apparently it is not caching the exact number of results, or whatever. I just don't remember that kind of behavior in the V2 search API (which I realize was 3rd party code). It was pretty reliable on number of results.
Does this strike anyone besides me as a little bit unexpected?
Upvotes: 1
Views: 844
Reputation: 6438
Revisiting the API & and I've come up with a way to paginate efficiently without having to use the "totalEstimatedMatches"
return value:
class ApiWorker(object):
def __init__(self, q):
self.q = q
self.offset = 0
self.result_hashes = set()
self.finished = False
def calc_next_offset(self, resp_urls):
before_adding = len(self.result_hashes)
self.result_hashes.update((hash(i) for i in resp_urls)) #<==abuse of set operations.
after_adding = len(self.result_hashes)
if after_adding == before_adding: #<==then we either got a bunch of duplicates or we're getting very few results back.
self.complete = True
else:
self.offset += len(new_results)
def page_through_results(self, *args, **kwargs):
while not self.finished:
new_resp_urls = ...<call_logic>...
self.calc_next_offset(new_resp_urls)
...<save logic>...
print(f'All unique results for q={self.q} have been obtained.')
This^ will stop paginating as soon as a full response of duplicates have been obtained.
Upvotes: 0
Reputation: 6438
Turns out this is the reason why the response JSON field totalEstimatedMatches
includes the word ...Estimated...
and isn't just called totalMatches
:
"...search engine index does not support an accurate estimation of total match."
Taken from: News Search API V5 paging results with offset and count
As one might expect, the fewer results you get back, the larger % error you're likely to see in the totalEstimatedMatches
value. Similarly, the more complex your query is (for example running a compound query such as ../search?q=(foo OR bar OR foobar)&...
which is actually 3 searches packed into 1) the more variation this value seems to exhibit.
That said, I've managed to (at least preliminarily) compensate for this by setting the offset == totalEstimatedMatches
and creating a simple equivalency-checking function.
Here's a trivial example in python:
while True:
if original_totalEstimatedMatches < new_totalEstimatedMatches:
original_totalEstimatedMatches = new_totalEstimatedMatches.copy()
#set_new_offset_and_call_api() is a func that does what it says.
new_totalEstimatedMatches = set_new_offset_and_call_api()
else:
break
Upvotes: 1