salihcenap
salihcenap

Reputation: 1937

How to speed up Mongodb inserts?

I am trying to create a big data app using mongodb (coding in Java). My collection consists of ordinary text documents. Since I do not want duplicates and documents' text fields are too big to create unique index on, I decided to calculate checksum value (MessageDigest with MD5) for text of each document, save this field in the document and create a unique index on this field.

Roughly my document has a structure like:

{
"_id": ObjectId('5336b4942c1a99c94275e1e6')
"textval": "some long text"
"checksum": "444066ed458746374238266cb9dcd20c"
"some_other_field": "qwertyuıop"
}

So when I am adding a new document to my collection, first I try to find if it exists by finding a document with that checksum value. If it exists I update (other fields of) it, otherwise I insert the new document.

This strategy works! But after one million documents in the collection I started getting unacceptable insert durations. Both cheksum lookups and inserts slowed down. I can insert ~30,000 docs in almost 1 hour! I have read about bulk inserts but could not decide what to do with duplicate records if I go in that direction. Any recommendations on strategy to speed things up?

Upvotes: 3

Views: 799

Answers (1)

Kalman
Kalman

Reputation: 158

I think it would be much faster if you used another collection containing only the checksum and update_time filelds. And when you insert your normal JSON document, then you should insert this short JSON document as well:

Your normal JSON document:
{
"_id": ObjectId('5336b4942c1a99c94275e1e6')
"textval": "some long text"
"checksum": "444066ed458746374238266cb9dcd20c"
"update_time": new Date(1396220136948)
"some_other_field": "qwertyuıop"
}

The short JSON document:
{
"_id": ...
"checksum": "444066ed458746374238266cb9dcd20c"
"update_time": new Date(1396220136948)
}

Upvotes: 1

Related Questions