toy
toy

Reputation: 12141

Mongodb performance degrades significantly over time with upsert.

I'm using Mongodb as a cache right now. The application will be fed with 3 CSVs over night and the CSVs get bigger because new products will be added all the time. Right now, I'm reached 5 million records and it took about 2 hours to process everything. As the cache is refreshed everyday it'll become impractical to refresh the data.

For example

CSV 1
ID, NAME
1, NAME!

CSV 2
ID, DESCRIPTION
1, DESC

CSV 3
ID, SOMETHING_ELSE
1, SOMETHING_ELSE

The application will read CSV 1 and put it in the database. Then CSV 2 will be read if there're new information it will be added to the same document or create a new record. The same logic applies for CSV 3. So, one document will get different attributes from different CSVs hence the upsert. After everything is done then all the documents will be indexes.

Right now the first 1 million documents is relatively quick, but I can see the performance degrades considerably over time. I'm guessing it's because of the upsert as Mongodb has to find the document and update the attributes otherwise create it. I'm using Java Driver and MongoDB 2.4. Is there anyway I could improve or even do batch upsert in mongodb java driver?

Upvotes: 0

Views: 436

Answers (1)

Daniel Coupal
Daniel Coupal

Reputation: 825

What do you mean by 'after everything is done then all the documents will be indexed'? If it is because you want to add additional indexes, it is debatable to do it at the end, but it is fine. If you have absolutely no indexes, then this is likely your issue.

You want to ensure that all inserts/upserts you are doing are using an index. You can run one command and use .explain() to see if an index is getting used appropriately. You need an index, otherwise you are scanning 1 million documents for each insert/update.

Also, can you also give more details about your application?

  1. are you going to do the import in 3 phases only once, or will you do frequent updates?
  2. do CSV2 and CSV3 modify a large percentage of the documents?
  3. do the modifications of CSV2 and CSV3 add or replace documents?
  4. what is the average size of your documents?

Let's assume you are doing a lot updates on the same documents many times. For example, CSV2 and CSV3 have updates on the same documents. Instead of importing for CSV1, then doing updates for CSV2, then another set of updates for CSV3, you may want to simply keep the documents in the memory of your application, apply all the updates in memory, then push your documents in the database. That assumes that you have enough RAM to do the operation, otherwise you will be using the disk again.

Upvotes: 1

Related Questions