Reputation: 12141
I'm using Mongodb as a cache right now. The application will be fed with 3 CSVs over night and the CSVs get bigger because new products will be added all the time. Right now, I'm reached 5 million records and it took about 2 hours to process everything. As the cache is refreshed everyday it'll become impractical to refresh the data.
For example
CSV 1
ID, NAME
1, NAME!
CSV 2
ID, DESCRIPTION
1, DESC
CSV 3
ID, SOMETHING_ELSE
1, SOMETHING_ELSE
The application will read CSV 1 and put it in the database. Then CSV 2 will be read if there're new information it will be added to the same document or create a new record. The same logic applies for CSV 3. So, one document will get different attributes from different CSVs hence the upsert
. After everything is done then all the documents will be indexes.
Right now the first 1 million documents is relatively quick, but I can see the performance degrades considerably over time. I'm guessing it's because of the upsert as Mongodb has to find the document and update the attributes otherwise create it. I'm using Java Driver and MongoDB 2.4. Is there anyway I could improve or even do batch upsert in mongodb java driver?
Upvotes: 0
Views: 436
Reputation: 825
What do you mean by 'after everything is done then all the documents will be indexed'? If it is because you want to add additional indexes, it is debatable to do it at the end, but it is fine. If you have absolutely no indexes, then this is likely your issue.
You want to ensure that all inserts/upserts you are doing are using an index. You can run one command and use .explain() to see if an index is getting used appropriately. You need an index, otherwise you are scanning 1 million documents for each insert/update.
Also, can you also give more details about your application?
Let's assume you are doing a lot updates on the same documents many times. For example, CSV2 and CSV3 have updates on the same documents. Instead of importing for CSV1, then doing updates for CSV2, then another set of updates for CSV3, you may want to simply keep the documents in the memory of your application, apply all the updates in memory, then push your documents in the database. That assumes that you have enough RAM to do the operation, otherwise you will be using the disk again.
Upvotes: 1