Reputation: 1079
I'm trying to learn MapReduce function in MongoDB. Instead of using an aggregation, I want to group documents in collection by key defined by myself using MapReduce function.
My collection Cool is:
/* 1 */ { "_id" : ObjectId("55d5e7287e41390ea7e83a55"), "id" : "a", "cool" : "a1" }
/* 2 */ { "_id" : ObjectId("55d5e7287e41390ea7e83a56"), "id" : "a", "cool" : "a2" }
/* 3 */ { "_id" : ObjectId("55d5e7287e41390ea7e83a57"), "id" : "b", "cool" : "b1" }
/* 4 */ { "_id" : ObjectId("55d5e7287e41390ea7e83a58"), "id" : "b", "cool" : "b2" }
/* 5 */ { "_id" : ObjectId("55d5e7287e41390ea7e83a59"), "id" : "c", "cool" : "c1" }
/* 6 */ { "_id" : ObjectId("55d5e7287e41390ea7e83a5a"), "id" : "d", "cool" : "d1" }
Here is my MapReduce function:
db.Cool.mapReduce(
function(){emit(this.id, this.cool)},
function(key, values){
var res = [];
values.forEach(function(v){
res.push(v);
});
return {cools: res};
},
{out: "MapReduce"}
)
I want get result like that:
/* 1 */ { "_id" : "a", "value" : { "cools" : [ "a1", "a2" ] } }
But in the returning collection, there are:
/* 1 */ { "_id" : "a", "value" : { "cools" : [ "a1", "a2" ] } }
/* 2 */ { "_id" : "b", "value" : { "cools" : [ "b1", "b2" ] } }
/* 3 */ { "_id" : "c", "value" : "c1" }
/* 4 */ { "_id" : "d", "value" : "d1" }
The question is: why there a different between document "id":"a" (there are more than one document of "id":"a") and document of "id":"c" (there is only one document of "id":"c")
Thanks for any suggestion and sorry for my bad English.
Upvotes: 0
Views: 1650
Reputation: 50436
In your learning you might have missed the core manual page on mapReduce. There is one vital piece of information that you either missed or have not read and learned:
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
And then a bit after that:
the type of the return object must be identical to the type of the value emitted by the map function.
So what that basically means is that because the "reducer" does not actually process "all" of the unique keys all at once, then it expects the same "input" as it gives "output", since that output can be fed back into the reducer again.
For the same reason the "mapper" needs to output exactly what is expected as the "reducer" output, which is also the reducer "input". So you don't actually "change" the data structure at all, but just "reduce" it instead.
db.Cool.mapReduce(
function(){emit(this.id, { "cools": [this.cool] })},
function(key, values){
var res = [];
values.forEach(function(cool){
cool.cools.forEach(function(v) {
res.push(v);
});
});
return {cools: res};
},
{out: "MapReduce"}
)
Now you are handling the input as an array which is also the output, then the expected results are returned.
The next thing to learn is that in most cases mapReduce is not really what you want to use, and that you should be using the aggregation framework instead.
As opposed to mapReduce, this uses "natively coded" operators and does not need JavaScript interpretation to run. And that largely means it is "faster" and often a lot more simple in construction.
Here is the same operation with .aggregate()
:
db.Cool.aggregate([
{ "$group": {
"_id": "$id",
"cools": { "$push": "$cool" }
}}
])
Same thing, less coding and a lot faster.
Outputing to another collection you use $out
:
db.Cool.aggregate([
{ "$group": {
"_id": "$id",
"cools": { "$push": "$cool" }
}},
{ "$out": "reduced" }
])
For the record, here is the mapReduce output:
{ "_id" : "a", "value" : { "cools" : [ "a1", "a2" ] } }
{ "_id" : "b", "value" : { "cools" : [ "b1", "b2" ] } }
{ "_id" : "c", "value" : { "cools" : [ "c1" ] } }
{ "_id" : "d", "value" : { "cools" : [ "d1" ] } }
And the aggregate output. With the only difference from the mapReduce _id
and value
madatory output being that the keys are reversed, since $group
does not guarantee an order ( but is generally observed as a reverse stack ):
{ "_id" : "d", "cools" : [ "d1" ] }
{ "_id" : "c", "cools" : [ "c1" ] }
{ "_id" : "b", "cools" : [ "b1", "b2" ] }
{ "_id" : "a", "cools" : [ "a1", "a2" ] }
Upvotes: 3
Reputation: 7920
Your return value in map function and reduce function needs to be identical. Otherwise single values in your collection will be returned as what you specified in your map function. This happens due to an optimization as reduce function will not be executed for keys which return single value in map phase. Here is how you can do it:
db.Cool.mapReduce(
function () {
emit(this.id, {cools: [this.cool]}) // same data structure as in your reduce function
},
function (key, values) {
var res = {cools: []}; // same data structure as the value of map phase
values.forEach(function (v) {
res.cools = res.cools.concat(v.cools);
});
return res;
},
{out: "MapReduce"}
)
Upvotes: 2