emepyc
emepyc

Reputation: 969

mongo db design for fast queries on ranges

Currently, I have a mongoDb collection with documents of the following type (~1000 docs):

{   _id : "...",
     id : "0000001",
   gaps : [{start:100, end:110}, {start:132, end:166}...], // up to ~1k elems
  bounds: [68, 88, 126, 228...],                           // up to 100 elems
  length: 300,
    seq : "ACCGACCCGAGGACCCCTGAGATG..."
}

"gaps" and "bounds" values in the array refer to coordinates in "seq" and "length" refers to "seq" length. I have defined indexes on "id", "gaps" and "bounds". I need to query based on coordinates ranges. For example given "from=100" and "to=200" I want to retrieve for each document a sub-array of "gaps", a sub-array of "bounds" and the "seq" substring that lay inside the range (between 100 and 200 in this case). Right now, this is done using the following aggregation pipeline:

    db.annot.aggregate(
         {$match  : {"id" : "000001"}},
         {$unwind : "$gaps"},
         {$unwind : "$bounds"},
         {$match: {
                    $or :[
                         {"gaps.start" : {$gte:from, $lte:to}},
                         {"gaps.end" : {$gte:from, $lte:to}},
                         {"bounds" : {$gte:from, $lte:to}}
                    ]
                  }
         },
         {$project:{
                 id:1,
                 gaps:1,
                 bounds:1,
                 subseq:{$substr:["$seq", from, (to-from)]}}},
         {$group : {
                 _id : "$id",
                 gaps : {"$addToSet" : "$gaps"},
                 bounds : {"$addToSet" : "$bounds"},  
                 subseq : {"$first" : "$subseq"}}},
    )                                                                                                                         

What would be the best db and query design to maximize the performance of this kind of queries?

Upvotes: 1

Views: 277

Answers (1)

BatScream
BatScream

Reputation: 19700

What would be the best db and query design to maximize the performance of this kind of queries?

Since you ask for improving your code and design, i suggest you to switch to the latest version of mongodb if you have not yet. That should be a good starting point. For these type of problems, the basic idea should be to reduce the number of documents being input to each pipeline operation.

I suggest you to have a additional variable named range which contains all the numbers between from and to, inclusive of both. This allows us to apply functions like $intersection on the bounds array.

So the variables, the aggregate operation needs from the environment should be:

var from = x; var to = y; var range=[x,...,y];
  • The first step is to match the number of documents that have the id,gaps sub documents and bounds value in our range. This reduces the number of documents being input to the next stage to say 500.
  • The next step is to $redact the non conforming gaps sub documents. This stage now works on the 500 documents filtered in the previous step.
  • The third step is to $project our fields as our need.

Notice that we have not required to use the $unwind operator anywhere and achieved the task.

db.collection.aggregate([
{$match  : {"id" : "0000001",
            "gaps.start":{$gte:from},
            "gaps.end":{$lte:to},
            "bounds" : {$gte:from, $lte:to}}},
{$redact:{
         $cond:[
                 {$and:[
                        {$gte:[{$ifNull: ["$start", from]},from]},
                        {$lte:[{$ifNull: ["$end", to]},to]}
                       ]},"$$DESCEND","$$PRUNE"
               ]
         }},
{$project: {
            "bounds":{$setIntersection:[range,"$bounds"]},
            "id":1,
            "gaps":1,
            "length":1,
            "subseq":{$substr:["$seq", from, (to-from)]}}}
])

Upvotes: 1

Related Questions