Reputation: 1069
Running into a rather bizarre issue right now that I'd love some help with. For whatever reason the below code runs when I add a meaningless limit (as the number of documents in collection), but when I remove the limit despite the result being the same volume the request times out. Any help greatly appreciated!
from pymongo import MongoClient
import pandas as pd
mongodb = MongoClient('mongodb://%s:%s@%s:%s' % (username, password, host, port))
numdocs = mongodb[collection].count_documents({})
##800,000
#Runs in 11.7s
results = pd.DataFrame(list(mongodb[collection].find({}).limit(numdocs)))
#Times out, or runs 1hr+ mins
results = pd.DataFrame(list(mongodb[collection].find({})))
Thanks to @phalanx's rec of running the explain statements, it looks like the root cause of this is pymongo's winning plan differing between the two queries:
mongodb[collection].find({}).explain()
"""{'queryPlanner': {'plannerVersion': 1,
'namespace': 'mongodb.collection',
'winningPlan': {'stage': 'COLLSCAN'}},
'serverInfo': {'host': 'mongodbhost',
'port': 27017,
'version': '3.6.0'},
'ok': 1.0}"""
mongodb[collection].find({}).limit(numdocs).explain()
"""
{'queryPlanner': {'plannerVersion': 1,
'namespace': 'mongodb.collection',
'winningPlan': {'stage': 'SUBSCAN',
'inputStage': {'stage': 'LIMIT_SKIP',
'inputStage': {'stage': 'COLLSCAN'}}}},
'serverInfo': {'host': 'mongodbhost',
'port': 27017,
'version': '3.6.0'},
'ok': 1.0}"""
I'm going to leave this question open for the time being since while I now have a better idea of what's going on, would still be great if someone could answer:
Planning on opening a ticket on the pymongo git, just want to make sure there aren't any obvious configuration steps I'm missing here.
Upvotes: 1
Views: 173
Reputation: 953
Have you tried running your query with a .explain("executionStats") to try and figure out what's going on?
Upvotes: 3