PyMongo cursor batch_size

Question

With PyMongo 3.7.2 I'm trying to read a collection in chunks by using batch_size on the MongoDB cursor, as described here. The basic idea is to use the find() method on the collection object, with batch_size as parameter. But whatever I try, the cursor always returns all documents in my collection.

A basic snippet of my code looks like this (the collection has over 10K documents):

import pymongo as pm

client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')

cur = coll.find({}, batch_size=500)

However, the cursor always returns the full collection size immediately. I'm using it as described in the docs.

Does anyone have an idea how I would properly iterate over the collection in batches? There are ways to loop over the output of the find() method, but that would still get the full collection first, and would only loop over the already pulled documents in memory. The batch_size parameter is supposed to get a batch and make a round-trip every time to the server, to save memory space.

kevinadi · Accepted Answer

Pymongo has some quality-of-life helpers for the Cursor class, so it will automatically do the batching for you, and return result to you in terms of documents.

The batch_size setting is set, but the idea is you only need to set it in the find() method, and not have to do manual low level calls or iterating through the batches.

For example, if I have 100 documents in my collection:

> db.test.count()
100

I then set the profiling level to log all queries:

> db.setProfilingLevel(0,-1)
{
  "was": 0,
  "slowms": 100,
  "sampleRate": 1,
  "ok": 1,
...

I then use pymongo to specify batch_size of 10:

import pymongo
import bson

conn = pymongo.MongoClient()
cur = conn.test.test.find({}, {'txt':0}, batch_size=10)
print(list(cur))

Running that query, I see in the MongoDB log:

2019-02-22T15:03:54.522+1100 I COMMAND  [conn702] command test.test command: find { find: "test", filter: {} ....
2019-02-22T15:03:54.523+1100 I COMMAND  [conn702] command test.test command: getMore { getMore: 266777378048, collection: "test", batchSize: 10, .... 
(getMore repeated 9 more times)

So the query was fetched from the server in the specified batches. It's just hidden from you via the Cursor class.

Edit

If you really need to get the documents in batches, there is a function find_raw_batches() under Collection (doc link). This method works similarly to find() and accepts the same parameters. However be advised that it will return raw BSON which will need to be decoded by the application in a separate step. Notably, this method does not support sessions.

Having said that, if the aim is to lower the application's memory usage, it's worth considering modifying the query so that it uses ranges instead. For example:

find({'$gte': , '$lte': })

Range queries are easier to optimize, can use indexes, and (in my opinion) easier to debug and easier to restart should the query gets interrupted. This is less flexible when using batches, where you have to restart the query from scratch and go over all the batches again if it gets interrupted.

PyMongo cursor batch_size

Answers (2)

Related Questions