Reputation: 373
With PyMongo 3.7.2 I'm trying to read a collection in chunks by using batch_size on the MongoDB cursor, as described here. The basic idea is to use the find() method on the collection object, with batch_size as parameter. But whatever I try, the cursor always returns all documents in my collection.
A basic snippet of my code looks like this (the collection has over 10K documents):
import pymongo as pm
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cur = coll.find({}, batch_size=500)
However, the cursor always returns the full collection size immediately. I'm using it as described in the docs.
Does anyone have an idea how I would properly iterate over the collection in batches? There are ways to loop over the output of the find() method, but that would still get the full collection first, and would only loop over the already pulled documents in memory. The batch_size parameter is supposed to get a batch and make a round-trip every time to the server, to save memory space.
Upvotes: 21
Views: 22117
Reputation: 36350
This is how I do it, it helps getting the data chunked up but I thought there would be a more straight forward way to do this. I created a yield_rows function that gets you the generates and yields chunks, it ensures the used chunks are deleted.
import pymongo as pm
CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cursor = coll.find({}, batch_size=CHUNK_SIZE)
def yield_rows(cursor, chunk_size):
"""
Generator to yield chunks from cursor
:param cursor:
:param chunk_size:
:return:
"""
chunk = []
for i, row in enumerate(cursor):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
chunks = yield_rows(cursor, CHUNK_SIZE)
for chunk in chunks:
# do processing here
pass
If I find a cleaner, more efficient way to do this I'll update my answer.
Upvotes: 13
Reputation: 13795
Pymongo has some quality-of-life helpers for the Cursor
class, so it will automatically do the batching for you, and return result to you in terms of documents.
The batch_size
setting is set, but the idea is you only need to set it in the find()
method, and not have to do manual low level calls or iterating through the batches.
For example, if I have 100 documents in my collection:
> db.test.count()
100
I then set the profiling level to log all queries:
> db.setProfilingLevel(0,-1)
{
"was": 0,
"slowms": 100,
"sampleRate": 1,
"ok": 1,
...
I then use pymongo to specify batch_size
of 10:
import pymongo
import bson
conn = pymongo.MongoClient()
cur = conn.test.test.find({}, {'txt':0}, batch_size=10)
print(list(cur))
Running that query, I see in the MongoDB log:
2019-02-22T15:03:54.522+1100 I COMMAND [conn702] command test.test command: find { find: "test", filter: {} ....
2019-02-22T15:03:54.523+1100 I COMMAND [conn702] command test.test command: getMore { getMore: 266777378048, collection: "test", batchSize: 10, ....
(getMore repeated 9 more times)
So the query was fetched from the server in the specified batches. It's just hidden from you via the Cursor
class.
Edit
If you really need to get the documents in batches, there is a function find_raw_batches()
under Collection (doc link). This method works similarly to find()
and accepts the same parameters. However be advised that it will return raw BSON which will need to be decoded by the application in a separate step. Notably, this method does not support sessions.
Having said that, if the aim is to lower the application's memory usage, it's worth considering modifying the query so that it uses ranges instead. For example:
find({'$gte': <some criteria>, '$lte': <some other criteria>})
Range queries are easier to optimize, can use indexes, and (in my opinion) easier to debug and easier to restart should the query gets interrupted. This is less flexible when using batches, where you have to restart the query from scratch and go over all the batches again if it gets interrupted.
Upvotes: 21