Python with MongoDB documents iteration is very slow vs all documents as local variable in local RAM

Question

I made a script that transform information from a big number of documents from one collection into another collection with some new keys and values.

In mongodb I have 250k documents. Each document structure is something like this:

{_id:1, city:'City 1', population:200, happinessRating:20, otherRating:5, ...}

{_id:2, city:'City X', population:3000, happinessRating:15, otherRating:7, ...}

I made a python function that transform each document with some complex math formulas in another document with new keys and values that I insert in another collection.

The purpose of the script:

Get unique(distinct) cities that have more than 500 population, I have a total of 1300 cities.

Loop throw each City, get all documents that have that specific City as value for city key (example: all documents that have "city:'City 1'") and apply the python function for each document and then insert that new transformed document into another collection.

Scenario 1:

I take all the documents in one query from mongodb to a local variable (so I have everything in RAM memory)

I write some python code to make a dictionary with the name of each unique city as key and his documents as a list [].

I loop the dictionary, apply the function, insert the documents to the new collection.

Final new collection documents: 250k documents

Execution time: 4 minutes

Advantage: very fast executioin time

Disadvantage: it takes more than 20GB of RAM and it's not scalable when the number of documents will be 1M documents instead of 250k documents

Scenario 2:

I made a mongodb aggregation (or use distinct function) so I get a list of all unique cities names that have more then 500 population.

So now I have a list of 1300 cities. Same as Scenario 1.

Now I made a loop and for each city I get the mongodb documents of that specific city, apply the function and then insert the documents to the new collection.

Final new collection documents: 250k documents

Execution time: 72 minutes

Advantage: don't need too much RAM, less than 500MB

Disadvantage: it takes a lot of time to finish the execution

THE PROBLEM

So I don't understand why this is happening.

In Scenario 1, I made just 1 query to mongodb so I have everything in a python variable in my RAM memory.

In Scenario 2, I made 1301 queries to mongodb, 1 to get the distinct unique cities list, and 1 query for each city

MongoDB is a local server, so the connection is very fast. In Scenario 1 it takes less than 20s to get all documents to the local variable in python.

Another important information is that the funcion that transform each document is very complex and I can't use MongoDB Aggregation to transform it so I need to do it throw that python function.

I hope I have explained myself well.

Any suggestion or advice on how to go ahead with the project is welcome. Thanks

Python with MongoDB documents iteration is very slow vs all documents as local variable in local RAM

Answers (1)

Related Questions