Pymongo parallel_scan

Question

I need to read from a very large collection and do some operations on each of the document.

I'm using pymongo's parallel_scan to do those operations on a number of processes to improve efficiency.

cursors = mongo_collection.parallel_scan(6)

if __name__ == '__main__':
    processes = [multiprocessing.Process(target=process_cursor, args=(cursor,)) for cursor in cursors]

Though the processes that use these cursors start as expected and start running, all the processes finish their part and exit and finally only one process keeps running for a long time.

It looks like this is because parallel_scan does not equally distribute the documents among the cursors. How do I make all the cursors have an almost equal number of documents.

Pymongo parallel_scan

Answers (1)

Related Questions