Reputation: 11
I need to read from a very large collection and do some operations on each of the document.
I'm using pymongo's parallel_scan to do those operations on a number of processes to improve efficiency.
cursors = mongo_collection.parallel_scan(6)
if __name__ == '__main__':
processes = [multiprocessing.Process(target=process_cursor, args=(cursor,)) for cursor in cursors]
Though the processes that use these cursors start as expected and start running, all the processes finish their part and exit and finally only one process keeps running for a long time.
It looks like this is because parallel_scan does not equally distribute the documents among the cursors. How do I make all the cursors have an almost equal number of documents.
Upvotes: 1
Views: 1614
Reputation: 324
One solution that worked for me in a similar position was to increase the parameter passed to parallel_scan
. This parameter (which must have a value between 0 and 10,000) controls the maximum number of cursors that Mongo's parallelCollectionScan
command returns. Although getting back like 20 cursors or so starts a number of processes at once, the much shorter cursors' processes finish relatively quickly. This leaves the desired 4-5 cursors processing for much longer.
Also, a quick note: according to the PyMongo FAQ, PyMongo is not multiprocessing-safe on Unix systems. Your method ends up copying the MongoClient
used to call parallel_scan
to each new process it forks.
Upvotes: 1