Ben Saunders
Ben Saunders

Reputation: 1069

pymongo.find() timing out but working with limit of same volume as collection

Running into a rather bizarre issue right now that I'd love some help with. For whatever reason the below code runs when I add a meaningless limit (as the number of documents in collection), but when I remove the limit despite the result being the same volume the request times out. Any help greatly appreciated!

from pymongo import MongoClient
import pandas as pd

mongodb = MongoClient('mongodb://%s:%s@%s:%s' % (username, password, host, port))

numdocs = mongodb[collection].count_documents({})
##800,000

#Runs in 11.7s
results = pd.DataFrame(list(mongodb[collection].find({}).limit(numdocs)))

#Times out, or runs 1hr+ mins
results = pd.DataFrame(list(mongodb[collection].find({})))

UPDATE 10/22

Thanks to @phalanx's rec of running the explain statements, it looks like the root cause of this is pymongo's winning plan differing between the two queries:

mongodb[collection].find({}).explain()

"""{'queryPlanner': {'plannerVersion': 1,
'namespace': 'mongodb.collection',
'winningPlan': {'stage': 'COLLSCAN'}},
'serverInfo': {'host': 'mongodbhost',
'port': 27017,
'version': '3.6.0'},
'ok': 1.0}"""

mongodb[collection].find({}).limit(numdocs).explain()
"""
{'queryPlanner': {'plannerVersion': 1,
'namespace': 'mongodb.collection',
'winningPlan': {'stage': 'SUBSCAN',
'inputStage': {'stage': 'LIMIT_SKIP',
'inputStage': {'stage': 'COLLSCAN'}}}},
'serverInfo': {'host': 'mongodbhost',
'port': 27017,
'version': '3.6.0'},
'ok': 1.0}"""

I'm going to leave this question open for the time being since while I now have a better idea of what's going on, would still be great if someone could answer:

  1. Why the different winning query plan?
  2. Why is this different plan so much slower?

Planning on opening a ticket on the pymongo git, just want to make sure there aren't any obvious configuration steps I'm missing here.

Upvotes: 1

Views: 173

Answers (1)

prhmma
prhmma

Reputation: 953

Have you tried running your query with a .explain("executionStats") to try and figure out what's going on?

Upvotes: 3

Related Questions