Reputation: 23965
I have a bunch of documents in a MongoDB collection, each of which looks like this:
{
"_id" : ObjectId("539f5556e4b032123458ba30"),
"name" : "H0031324836",
"date" : ISODate("2014-04-01T04:00:00Z"),
"dateString" : "2014-04-01",
"elements" : [
{
"start_time" : ISODate("2014-04-01T15:00:00Z"),
"end_time" : ISODate("2014-04-01T16:00:00Z"),
"duration" : NumberLong(3600000),
"value" : 0.6968
},
{
"start_time" : ISODate("2014-04-01T16:00:00Z"),
"end_time" : ISODate("2014-04-01T17:00:00Z"),
"duration" : NumberLong(3600000),
"value" : 1.4873
},
// ...
]
}
For each of these documents, I want (through the aggregation framework, ideally) to end up with a document like this:
{
"_id" : ObjectId("539f5556e4b032123458ba30"),
"name" : "H0031324836",
"date" : ISODate("2014-04-01T04:00:00Z"),
"dateString" : "2014-04-01",
"duration" : NumberLong(...blahblah...), // sum of all "$duration" fields
"value" : ...blahblah..., // sum of all "$value" fields
}
I'm not seeing a way to vectorize over the $elements array and pick out values, though - perhaps $unwind
is an option, but that seems pretty inefficient if it truly explodes to a stream of documents just so I can implode them again.
The collection is large (around half a billion docs now, will be several billion when the full data is loaded), so I'm hoping to use the aggregation framework and not MapReduce if possible.
I've got MongoDB 2.6.0, on a hash-sharded collection with 8 shards.
Upvotes: 0
Views: 1479
Reputation: 9582
$unwind
is the answer, as you have stated. It is the way something like this was intended to be dealt with.
First, a bit about $unwind
. Note that the aggregation framework will only utilize your collection's indices up until the point where it mutates the documents, so be sure to handle the bulk of your filtering foremost. More on that in the SO answer for "Aggregation pipeline and indexes". $unwind
will behave superbly performance-wise since it is an optimized internal of MongoDB - it happens natively (C++) in the aggregation pipeline so you won't get degraded performance by running interpreted JavaScript (i.e. in MR). The MongoDB team has worked hard and over several iterations to ensure aggregation is fast.
Now what would this pipeline actually look like?
db.collection.aggregate([
{ $match: { name: "H0031324836" } }, // limit to just this record (or set of records, uses index)
{ $unwind: "$elements" }, // explode that array into individual documents
{
$group: { // regroup all of the similar ones based on the `name` field
_id: "$name",
duration: { $sum: "$elements.duration" }, // sum elements[n].duration
value: { $sum: "elements.sum" } // sum elements[n].value
}
}
]);
See the SO answer for "$unwind an object in aggregation framework" for more insight into the pipeline.
Aggregation will definitely take advantage of your 8 shards, "Aggregation Introduction - Additional Features and Behaviors":
The aggregation pipeline supports operations on sharded collections. See Aggregation Pipeline and Sharded Collections.
Upvotes: 4