Prat
Prat

Reputation: 11

Pymongo parallel_scan

I need to read from a very large collection and do some operations on each of the document.

I'm using pymongo's parallel_scan to do those operations on a number of processes to improve efficiency.

cursors = mongo_collection.parallel_scan(6)

if __name__ == '__main__':
    processes = [multiprocessing.Process(target=process_cursor, args=(cursor,)) for cursor in cursors]

Though the processes that use these cursors start as expected and start running, all the processes finish their part and exit and finally only one process keeps running for a long time.

It looks like this is because parallel_scan does not equally distribute the documents among the cursors. How do I make all the cursors have an almost equal number of documents.

Upvotes: 1

Views: 1614

Answers (1)

djsavvy
djsavvy

Reputation: 324

One solution that worked for me in a similar position was to increase the parameter passed to parallel_scan. This parameter (which must have a value between 0 and 10,000) controls the maximum number of cursors that Mongo's parallelCollectionScan command returns. Although getting back like 20 cursors or so starts a number of processes at once, the much shorter cursors' processes finish relatively quickly. This leaves the desired 4-5 cursors processing for much longer.

Also, a quick note: according to the PyMongo FAQ, PyMongo is not multiprocessing-safe on Unix systems. Your method ends up copying the MongoClient used to call parallel_scan to each new process it forks.

Upvotes: 1

Related Questions