Reputation: 1937
I am trying to create a big data app using mongodb (coding in Java). My collection consists of ordinary text documents. Since I do not want duplicates and documents' text fields are too big to create unique index on, I decided to calculate checksum value (MessageDigest with MD5) for text of each document, save this field in the document and create a unique index on this field.
Roughly my document has a structure like:
{ "_id": ObjectId('5336b4942c1a99c94275e1e6') "textval": "some long text" "checksum": "444066ed458746374238266cb9dcd20c" "some_other_field": "qwertyuıop" }
So when I am adding a new document to my collection, first I try to find if it exists by finding a document with that checksum value. If it exists I update (other fields of) it, otherwise I insert the new document.
This strategy works! But after one million documents in the collection I started getting unacceptable insert durations. Both cheksum lookups and inserts slowed down. I can insert ~30,000 docs in almost 1 hour! I have read about bulk inserts but could not decide what to do with duplicate records if I go in that direction. Any recommendations on strategy to speed things up?
Upvotes: 3
Views: 799
Reputation: 158
I think it would be much faster if you used another collection containing only the checksum and update_time filelds. And when you insert your normal JSON document, then you should insert this short JSON document as well:
Your normal JSON document:
{
"_id": ObjectId('5336b4942c1a99c94275e1e6')
"textval": "some long text"
"checksum": "444066ed458746374238266cb9dcd20c"
"update_time": new Date(1396220136948)
"some_other_field": "qwertyuıop"
}
The short JSON document:
{
"_id": ...
"checksum": "444066ed458746374238266cb9dcd20c"
"update_time": new Date(1396220136948)
}
Upvotes: 1