user3206440
user3206440

Reputation: 5069

Ruby MongodB - improving speed when working with multiple collections

I'm using MongoDB with Ruby using mongo gem.

I have the following scenario:

  1. for each document in a collection say coll1, look at key1 and key2
  2. search for document in another collection say coll2 with matching values for key1 and key2
  3. if there is a match, add document fetched in #2 with a new key key3 whose value be set to value of key3 in the document referenced in #1
  4. insert the updated hash into a new collection coll3

The general guideline with MongoDB has been to handle cross collection operations in application code.

So I do the following:

    client = Mongo::Client.new([ '127.0.0.1:27017' ], :database => some_db, 
                               :server_selection_timeout => 5)
    cursor = client[:coll1].find({}, { :projection => {:_id => 0} }) # exclude _id
    cursor.each do |doc|
        doc_coll2 = client[:coll2].find('$and' => [{:key1 => doc[:key1]}, {:key2 => doc[:key2] }]).limit(1).first # no find_one method
        if(doc_coll2 && doc[:key3])
            doc_coll2[:key3] = doc[:key3]
            doc_coll2.delete(:_id) # remove key :_id
            client[:coll3].insert_one(doc_coll2)
        end
    end

This works, but it takes a lot of time to finish this job - approximately 250ms per document in collection coll1 or 3600s (1 hour) for ~15000 records, which seems a lot, which could be associated with reading the document one at a time, do the check in app code and then writing one doc at a time back to a new collection.

Is there a way to get this operation be done faster? Is the way I'm doing even the right way to do it?

Example documents

Upvotes: 4

Views: 392

Answers (2)

user3206440
user3206440

Reputation: 5069

After toying around this for sometime, realized that index were not added. Adding index reduces the query run time by orders of magnitude.

To add index, do the following.

db.coll1.ensureIndex({"key1": 1, "key2": 1});
db.coll2.ensureIndex({"key1": 1, "key2": 1});

Using index the overall query run time came to 1/10xxxxxxth of what it was earlier.

The learning is that while working with large data sets, index the fields used for find - that itself reduces query run time a lot.

Upvotes: 0

felix
felix

Reputation: 9295

A solution would be to use aggregation instead, and do this in one single query:

  • perform a join on key1 field with $lookup
  • unwind the array with $unwind
  • keep doc where coll1.key2 == coll2.key2 with $redact
  • reformat the document with $project
  • write it to coll3 with $out

so the query would be :

db.coll1.aggregate([
    { "$lookup": { 
        "from": "coll2", 
        "localField": "key1", 
        "foreignField": "key1", 
        "as": "coll2_doc"
    }}, 
    { "$unwind": "$coll2_doc" },
    { "$redact": { 
        "$cond": [
            { "$eq": [ "$key2", "$coll2_doc.key2" ] }, 
            "$$KEEP", 
            "$$PRUNE"
        ]
    }}, 
    { 
      $project: {
         key1: 1, 
         key2: 1, 
         key3: 1, 
         key4: "$coll2_doc.key4",
         key5: "$coll2_doc.key5", 
         key6: "$coll2_doc.key6", 
         key7: "$coll2_doc.key7", 
         key8: "$coll2_doc.key8", 
     key9: "$coll2_doc.key9",  

      } 
    }, 
    {$out: "coll3"} 
], {allowDiskUse: true} );

and db.coll3.find() would return

{
    "_id" : ObjectId("588610ead0ae360cb815e55f"),
    "key1" : "115384042",
    "key2" : "276209",
    "key3" : "10101122317876",
    "key4" : 10,
    "key5" : 4,
    "key6" : 0,
    "key7" : "false",
    "key8" : 0,
    "key9" : "false"
}

Edit: MongoDB 3.4 solution

If you don't want to specify all keys in the $project stage, you can take advantage of $addFields and $replaceRoot, two new operators introduced in MongoDB 3.4

the query would become:

db.coll1.aggregate([
    { "$lookup": { 
        "from": "coll2", 
        "localField": "key1", 
        "foreignField": "key1", 
        "as": "coll2_doc"
    }}, 
    { "$unwind": "$coll2_doc" },
    { "$redact": { 
        "$cond": [
            { "$eq": [ "$key2", "$coll2_doc.key2" ] }, 
            "$$KEEP", 
            "$$PRUNE"
        ]
    }}, 
    {$addFields: {"coll2_doc.key3": "$key3" }},
    {$replaceRoot: {newRoot: "$coll2_doc"}},
    {$out: "coll3"} 
], {allowDiskUse: true})

Upvotes: 7

Related Questions