Daniel
Daniel

Reputation: 45

Python with MongoDB documents iteration is very slow vs all documents as local variable in local RAM

I made a script that transform information from a big number of documents from one collection into another collection with some new keys and values.

In mongodb I have 250k documents. Each document structure is something like this:

{_id:1, city:'City 1', population:200, happinessRating:20, otherRating:5, ...}

{_id:2, city:'City X', population:3000, happinessRating:15, otherRating:7, ...}

I made a python function that transform each document with some complex math formulas in another document with new keys and values that I insert in another collection.

The purpose of the script:

Get unique(distinct) cities that have more than 500 population, I have a total of 1300 cities.

Loop throw each City, get all documents that have that specific City as value for city key (example: all documents that have "city:'City 1'") and apply the python function for each document and then insert that new transformed document into another collection.

Scenario 1:

I take all the documents in one query from mongodb to a local variable (so I have everything in RAM memory)

I write some python code to make a dictionary with the name of each unique city as key and his documents as a list [].

I loop the dictionary, apply the function, insert the documents to the new collection.

Final new collection documents: 250k documents

Execution time: 4 minutes

Advantage: very fast executioin time

Disadvantage: it takes more than 20GB of RAM and it's not scalable when the number of documents will be 1M documents instead of 250k documents

Scenario 2:

I made a mongodb aggregation (or use distinct function) so I get a list of all unique cities names that have more then 500 population.

So now I have a list of 1300 cities. Same as Scenario 1.

Now I made a loop and for each city I get the mongodb documents of that specific city, apply the function and then insert the documents to the new collection.

Final new collection documents: 250k documents

Execution time: 72 minutes

Advantage: don't need too much RAM, less than 500MB

Disadvantage: it takes a lot of time to finish the execution

THE PROBLEM

So I don't understand why this is happening.

In Scenario 1, I made just 1 query to mongodb so I have everything in a python variable in my RAM memory.

In Scenario 2, I made 1301 queries to mongodb, 1 to get the distinct unique cities list, and 1 query for each city

MongoDB is a local server, so the connection is very fast. In Scenario 1 it takes less than 20s to get all documents to the local variable in python.

Another important information is that the funcion that transform each document is very complex and I can't use MongoDB Aggregation to transform it so I need to do it throw that python function.

I hope I have explained myself well.

Any suggestion or advice on how to go ahead with the project is welcome. Thanks

Upvotes: 1

Views: 759

Answers (1)

Wernfried Domscheit
Wernfried Domscheit

Reputation: 59456

This does not provide a real answer but it illustrates the principle to compose dynamic aggregation pipelines, reuse operators and make it clearly arranged:

var match = {}
match["$match"] = { a: 1 }

var sort = { b: -1 }

var pipeline = [];
pipeline.push(match);
pipeline.push({ $sort: sort });
pipeline = pipeline.concat({ $skip: 5 }, { $limit: 3 })

db.collection.aggregate(pipeline, { allowDiskUse: true })

You may give it a try.

Upvotes: 1

Related Questions