Reputation: 12007
I have a MongoDB collection with records of the form:
{
"_id" : ObjectId("562d6d9c3a2e9c0adbb02f14"),
"slug" : "1:955553",
"subslug" : "1:955553:02",
"score" : "0.615",
"position_start" : "1",
"position_end" : 955553,
"name" : "AGRN",
"ref" : "A"
},
{
"_id" : ObjectId("562d6d9c3a2e9c0adbb02f15"),
"slug" : "2:15553",
"subslug" : "2:15553:01",
"score" : "0.915",
"position_start" : "1002",
"position_end" : 15553,
"name" : "MMFR",
"ref" : "C"
}
{
"_id" : ObjectId("562d6d9c3a2e9c0adbb02f16"),
"slug" : "1:955553",
"subslug" : "1:955553:01",
"score" : "0.715",
"position_start" : 1,
"position_end" : 955553,
"name" : "AGRN",
"ref" : "A"
},
I would like to aggregate this collection and group by slug (notice the first and third records here have the same slug).
I am trying to aggregate my data into a new collection that looks like:
{
"_id" : "<?>",
"slug" : "1:955553",
"components" : [
{
"subslug": "1:955553:01",
"score": 0.615,
"position_start": 1,
"position_end": 955553,
"name": AGRN,
"ref": "A"
},
{
"subslug": "1:955553:02",
"score": 0.715,
"position_start": 1,
"position_end": 955553,
"name": AGRN,
"ref": "A"
},
]
},
{
"_id" : "<?>",
"slug" : "2:15553",
"components" : [
{
"subslug": "2:15553:01",
"score": 0.915,
"position_start": 1002,
"position_end": 15553,
"name": MMFR,
"ref": "C"
}
]
}
I am trying to use mapReduce() to accomplish this, but I am having some difficulties working out the details of the functions.
I have the following query:
db.getCollection('vest').mapReduce(
function() {
emit(this.slug, { 'components': {$push: this } });
},
function(key, components) {
return components[0];
},
{
out: 'mytable'
}
)
Yet this unfortunately builds a table that looks like:
{
"_id" : "1:955553",
"value" : {
"components" : {
"$push" : {
"_id" : ObjectId("562d6d9c3a2e9c0adbb02f14"),
"slug" : "1:955553",
"subslug" : "1:955553:01",
"position_start" : 1,
"position_end" : 955553,
"gene" : "AGRN",
"ref" : "A"
}
}
}
}
And this is not what I need. I was trying to use $push
to append a components
array, but clearly $push
is not respected in mapReduce()
.
Can anyone give me any pointers on how to take the input collection data above and create the desired output collection? Am I on the right track with my mapReduce()
query?
Upvotes: 1
Views: 748
Reputation: 61273
You really don't and should use mapReduce for this. You should instead you the .aggregate()
method which provides access to the aggregation pipeline. All you need is $group
your documents by "slug" and use the $push
accumulator operator to return an array of all other fields. The $project
stage is used to exclude the `_id' field from your aggregation result.
That being said you can use the $out
operator the resulting documents of the aggregation pipeline to another collection as mentioned in @chridam's answer but because
You cannot specify a sharded collection as the output collection. The input collection for a pipeline can be sharded.
The $out operator cannot write results to a capped collection.
You should use "Bulk" operations to write you result to a new collections.
var bulk = db.newcollection.initializeUnorderedBulkOp();
db.collection.aggregate([
{ "$group": {
"_id": "$slug",
"components": {
"$push": {
"subslug": "$subslug",
"score": "$score",
"position_start": "$position_start",
"position_end": "$position_end",
"name": "$name",
"ref": "$ref"
}
}
}},
{ "$project": {
"slug": "$_id",
"components": 1,
"_id": 0
}}
]).forEach(function(doc) {
bulk.insert(doc);
})
bulk.execute();
Then db.newcollection.find()
yields something like this:
{
"_id" : ObjectId("563bc8a6bf93306f8f6638ce"),
"components" : [
{
"slug" : "1:955553",
"subslug" : "1:955553:02",
"score" : "0.615",
"position_start" : "1",
"position_end" : 955553,
"name" : "AGRN",
"ref" : "A"
},
{
"slug" : "1:955553",
"subslug" : "1:955553:01",
"score" : "0.715",
"position_start" : 1,
"position_end" : 955553,
"name" : "AGRN",
"ref" : "A"
}
],
"slug" : "1:955553"
}
Upvotes: 1
Reputation: 103445
Best to use the aggregation framework
for such operation which should be multiple times faster than the map-reduce operation.
Typically you would construct an aggregation pipeline that consists of 3 stages:
$group
stage - This pipeline step groups the documents by the slug
field as its key, then apply the accumulator operator $push
to create the components
array that is a result of applying an expression to each document in the above group.$project
stage - This will reshape each document in the stream, such as by adding new fields or removing existing fields.$out
stage - This final step will writes the resulting documents of the aggregation pipeline to a new collection.So, using the above operation, running the following aggregation pipeline will give you the desired result in a new collection called mytable
:
db.vest.aggregate([
{
"$group": {
"_id": "$slug",
"components": {
"$push": {
"subslug": "$subslug",
"score": "$score",
"position_start": "$position_start",
"position_end": "$position_end",
"name": "$name",
"ref": "$ref"
}
}
}
},
{
"$project": {
"_id": 0, "slug": "$_id", "components": 1
}
},
{ "$out": "mytable" }
])
Querying this collection with the above sample data
db.mytable.find()
will give you the desired output:
Sample Output:
/* 0 */
{
"_id" : ObjectId("563bc608d1f71f49c3d6c80b"),
"components" : [
{
"subslug" : "2:15553:01",
"score" : "0.915",
"position_start" : "1002",
"position_end" : 15553,
"name" : "MMFR",
"ref" : "C"
}
],
"slug" : "2:15553"
}
/* 1 */
{
"_id" : ObjectId("563bc608d1f71f49c3d6c80c"),
"components" : [
{
"subslug" : "1:955553:02",
"score" : "0.615",
"position_start" : "1",
"position_end" : 955553,
"name" : "AGRN",
"ref" : "A"
},
{
"subslug" : "1:955553:01",
"score" : "0.715",
"position_start" : 1,
"position_end" : 955553,
"name" : "AGRN",
"ref" : "A"
}
],
"slug" : "1:955553"
}
Upvotes: 2