Reputation: 969
Currently, I have a mongoDb collection with documents of the following type (~1000 docs):
{ _id : "...",
id : "0000001",
gaps : [{start:100, end:110}, {start:132, end:166}...], // up to ~1k elems
bounds: [68, 88, 126, 228...], // up to 100 elems
length: 300,
seq : "ACCGACCCGAGGACCCCTGAGATG..."
}
"gaps" and "bounds" values in the array refer to coordinates in "seq" and "length" refers to "seq" length. I have defined indexes on "id", "gaps" and "bounds". I need to query based on coordinates ranges. For example given "from=100" and "to=200" I want to retrieve for each document a sub-array of "gaps", a sub-array of "bounds" and the "seq" substring that lay inside the range (between 100 and 200 in this case). Right now, this is done using the following aggregation pipeline:
db.annot.aggregate(
{$match : {"id" : "000001"}},
{$unwind : "$gaps"},
{$unwind : "$bounds"},
{$match: {
$or :[
{"gaps.start" : {$gte:from, $lte:to}},
{"gaps.end" : {$gte:from, $lte:to}},
{"bounds" : {$gte:from, $lte:to}}
]
}
},
{$project:{
id:1,
gaps:1,
bounds:1,
subseq:{$substr:["$seq", from, (to-from)]}}},
{$group : {
_id : "$id",
gaps : {"$addToSet" : "$gaps"},
bounds : {"$addToSet" : "$bounds"},
subseq : {"$first" : "$subseq"}}},
)
What would be the best db and query design to maximize the performance of this kind of queries?
Upvotes: 1
Views: 277
Reputation: 19700
What would be the best db and query design to maximize the performance of this kind of queries?
Since you ask for improving your code and design, i suggest you to switch to the latest version of mongodb
if you have not yet. That should be a good starting point. For these type of problems, the basic idea should be to reduce
the number of documents
being input
to each pipeline operation
.
I suggest you to have a additional variable named range
which contains all the numbers between
from
and to
, inclusive of both. This allows us to apply functions like $intersection
on the bounds
array.
So the variables, the aggregate operation needs from the environment should be:
var from = x; var to = y; var range=[x,...,y];
id
,gaps
sub documents and bounds
value in our range. This
reduces the number of documents being input to the next stage to say 500.$redact
the non conforming gaps sub
documents. This stage now works on the 500 documents filtered in the
previous step.$project
our fields as our need.Notice that we have not required to use the $unwind
operator anywhere and achieved the task.
db.collection.aggregate([
{$match : {"id" : "0000001",
"gaps.start":{$gte:from},
"gaps.end":{$lte:to},
"bounds" : {$gte:from, $lte:to}}},
{$redact:{
$cond:[
{$and:[
{$gte:[{$ifNull: ["$start", from]},from]},
{$lte:[{$ifNull: ["$end", to]},to]}
]},"$$DESCEND","$$PRUNE"
]
}},
{$project: {
"bounds":{$setIntersection:[range,"$bounds"]},
"id":1,
"gaps":1,
"length":1,
"subseq":{$substr:["$seq", from, (to-from)]}}}
])
Upvotes: 1