Reputation: 45
I made a script that transform information from a big number of documents from one collection into another collection with some new keys and values.
In mongodb I have 250k documents. Each document structure is something like this:
{_id:1, city:'City 1', population:200, happinessRating:20, otherRating:5, ...}
{_id:2, city:'City X', population:3000, happinessRating:15, otherRating:7, ...}
I made a python function that transform each document with some complex math formulas in another document with new keys and values that I insert in another collection.
The purpose of the script:
Get unique(distinct) cities that have more than 500 population, I have a total of 1300 cities.
Loop throw each City, get all documents that have that specific City as value for city key (example: all documents that have "city:'City 1'") and apply the python function for each document and then insert that new transformed document into another collection.
Scenario 1:
I take all the documents in one query from mongodb to a local variable (so I have everything in RAM memory)
I write some python code to make a dictionary with the name of each unique city as key and his documents as a list [].
I loop the dictionary, apply the function, insert the documents to the new collection.
Final new collection documents: 250k documents
Execution time: 4 minutes
Advantage: very fast executioin time
Disadvantage: it takes more than 20GB of RAM and it's not scalable when the number of documents will be 1M documents instead of 250k documents
Scenario 2:
I made a mongodb aggregation (or use distinct function) so I get a list of all unique cities names that have more then 500 population.
So now I have a list of 1300 cities. Same as Scenario 1.
Now I made a loop and for each city I get the mongodb documents of that specific city, apply the function and then insert the documents to the new collection.
Final new collection documents: 250k documents
Execution time: 72 minutes
Advantage: don't need too much RAM, less than 500MB
Disadvantage: it takes a lot of time to finish the execution
THE PROBLEM
So I don't understand why this is happening.
In Scenario 1, I made just 1 query to mongodb so I have everything in a python variable in my RAM memory.
In Scenario 2, I made 1301 queries to mongodb, 1 to get the distinct unique cities list, and 1 query for each city
MongoDB is a local server, so the connection is very fast. In Scenario 1 it takes less than 20s to get all documents to the local variable in python.
Another important information is that the funcion that transform each document is very complex and I can't use MongoDB Aggregation to transform it so I need to do it throw that python function.
I hope I have explained myself well.
Any suggestion or advice on how to go ahead with the project is welcome. Thanks
Upvotes: 1
Views: 759
Reputation: 59456
This does not provide a real answer but it illustrates the principle to compose dynamic aggregation pipelines, reuse operators and make it clearly arranged:
var match = {}
match["$match"] = { a: 1 }
var sort = { b: -1 }
var pipeline = [];
pipeline.push(match);
pipeline.push({ $sort: sort });
pipeline = pipeline.concat({ $skip: 5 }, { $limit: 3 })
db.collection.aggregate(pipeline, { allowDiskUse: true })
You may give it a try.
Upvotes: 1