Ruby MongodB - improving speed when working with multiple collections

Question

I'm using MongoDB with Ruby using mongo gem.

I have the following scenario:

for each document in a collection say coll1, look at key1 and key2
search for document in another collection say coll2 with matching values for key1 and key2
if there is a match, add document fetched in #2 with a new key key3 whose value be set to value of key3 in the document referenced in #1
insert the updated hash into a new collection coll3

The general guideline with MongoDB has been to handle cross collection operations in application code.

So I do the following:

    client = Mongo::Client.new([ '127.0.0.1:27017' ], :database => some_db, 
                               :server_selection_timeout => 5)
    cursor = client[:coll1].find({}, { :projection => {:_id => 0} }) # exclude _id
    cursor.each do |doc|
        doc_coll2 = client[:coll2].find('$and' => [{:key1 => doc[:key1]}, {:key2 => doc[:key2] }]).limit(1).first # no find_one method
        if(doc_coll2 && doc[:key3])
            doc_coll2[:key3] = doc[:key3]
            doc_coll2.delete(:_id) # remove key :_id
            client[:coll3].insert_one(doc_coll2)
        end
    end

This works, but it takes a lot of time to finish this job - approximately 250ms per document in collection coll1 or 3600s (1 hour) for ~15000 records, which seems a lot, which could be associated with reading the document one at a time, do the check in app code and then writing one doc at a time back to a new collection.

Is there a way to get this operation be done faster? Is the way I'm doing even the right way to do it?

Example documents

coll1

{
    "_id" : ObjectId("588610ead0ae360cb815e55f"),
    "key1" : "115384042",
    "key2" : "276209",
    "key3" : "10101122317876"
}

coll2

{
    "_id" : ObjectId("788610ead0ae360def15e88e"),
    "key1" : "115384042",
    "key2" : "276209",
    "key4" : 10,
    "key5" : 4,
    "key6" : 0,
    "key7" : "false",
    "key8" : 0,
    "key9" : "false"
}

coll3

{
    "_id" : ObjectId("788610ead0ae360def15e88e"),
    "key1" : "115384042",
    "key2" : "276209",
    "key3" : "10101122317876",
    "key4" : 10,
    "key5" : 4,
    "key6" : 0,
    "key7" : "false",
    "key8" : 0,
    "key9" : "false"
}

felix · Accepted Answer

A solution would be to use aggregation instead, and do this in one single query:

perform a join on key1 field with $lookup
unwind the array with $unwind
keep doc where coll1.key2 == coll2.key2 with $redact
reformat the document with $project
write it to coll3 with $out

so the query would be :

db.coll1.aggregate([
    { "$lookup": { 
        "from": "coll2", 
        "localField": "key1", 
        "foreignField": "key1", 
        "as": "coll2_doc"
    }}, 
    { "$unwind": "$coll2_doc" },
    { "$redact": { 
        "$cond": [
            { "$eq": [ "$key2", "$coll2_doc.key2" ] }, 
            "$$KEEP", 
            "$$PRUNE"
        ]
    }}, 
    { 
      $project: {
         key1: 1, 
         key2: 1, 
         key3: 1, 
         key4: "$coll2_doc.key4",
         key5: "$coll2_doc.key5", 
         key6: "$coll2_doc.key6", 
         key7: "$coll2_doc.key7", 
         key8: "$coll2_doc.key8", 
     key9: "$coll2_doc.key9",  

      } 
    }, 
    {$out: "coll3"} 
], {allowDiskUse: true} );

and db.coll3.find() would return

{
    "_id" : ObjectId("588610ead0ae360cb815e55f"),
    "key1" : "115384042",
    "key2" : "276209",
    "key3" : "10101122317876",
    "key4" : 10,
    "key5" : 4,
    "key6" : 0,
    "key7" : "false",
    "key8" : 0,
    "key9" : "false"
}

Edit: MongoDB 3.4 solution

If you don't want to specify all keys in the $project stage, you can take advantage of $addFields and $replaceRoot, two new operators introduced in MongoDB 3.4

the query would become:

db.coll1.aggregate([
    { "$lookup": { 
        "from": "coll2", 
        "localField": "key1", 
        "foreignField": "key1", 
        "as": "coll2_doc"
    }}, 
    { "$unwind": "$coll2_doc" },
    { "$redact": { 
        "$cond": [
            { "$eq": [ "$key2", "$coll2_doc.key2" ] }, 
            "$$KEEP", 
            "$$PRUNE"
        ]
    }}, 
    {$addFields: {"coll2_doc.key3": "$key3" }},
    {$replaceRoot: {newRoot: "$coll2_doc"}},
    {$out: "coll3"} 
], {allowDiskUse: true})

Ruby MongodB - improving speed when working with multiple collections

Answers (2)

Edit: MongoDB 3.4 solution

Related Questions