Brett
Brett

Reputation: 12007

MongoDB mapReduce() query to aggregate data into a list inside a record

I have a MongoDB collection with records of the form:

{
    "_id" : ObjectId("562d6d9c3a2e9c0adbb02f14"),
    "slug" : "1:955553",
    "subslug" : "1:955553:02",
    "score" : "0.615",
    "position_start" : "1",
    "position_end" : 955553,
    "name" : "AGRN",
    "ref" : "A"
},
{
    "_id" : ObjectId("562d6d9c3a2e9c0adbb02f15"),
    "slug" : "2:15553",
    "subslug" : "2:15553:01",
    "score" : "0.915",
    "position_start" : "1002",
    "position_end" : 15553,
    "name" : "MMFR",
    "ref" : "C"
}
{
    "_id" : ObjectId("562d6d9c3a2e9c0adbb02f16"),
    "slug" : "1:955553",
    "subslug" : "1:955553:01",
    "score" : "0.715",
    "position_start" : 1,
    "position_end" : 955553,
    "name" : "AGRN",
    "ref" : "A"
},

I would like to aggregate this collection and group by slug (notice the first and third records here have the same slug).

I am trying to aggregate my data into a new collection that looks like:

{
    "_id" : "<?>",
    "slug" : "1:955553",
    "components" : [ 
       { 
         "subslug": "1:955553:01",
         "score": 0.615,
         "position_start": 1,
         "position_end": 955553,
         "name": AGRN,
         "ref": "A"
       },
       { 
         "subslug": "1:955553:02",
         "score": 0.715,
         "position_start": 1,
         "position_end": 955553,
         "name": AGRN,
         "ref": "A"
       },

    ]
},
{
    "_id" : "<?>",
    "slug" : "2:15553",
    "components" : [ 
       { 
         "subslug": "2:15553:01",
         "score": 0.915,
         "position_start": 1002,
         "position_end": 15553,
         "name": MMFR,
         "ref": "C"
       }
    ]
}

I am trying to use mapReduce() to accomplish this, but I am having some difficulties working out the details of the functions.

I have the following query:

db.getCollection('vest').mapReduce(
    function() {
        emit(this.slug, { 'components': {$push: this } });
    },
    function(key, components) {
        return components[0];
    },
    {
        out: 'mytable'
    }
)

Yet this unfortunately builds a table that looks like:

{
    "_id" : "1:955553",
    "value" : {
        "components" : {
            "$push" : {
                "_id" : ObjectId("562d6d9c3a2e9c0adbb02f14"),
                "slug" : "1:955553",
                "subslug" : "1:955553:01",
                "position_start" : 1,
                "position_end" : 955553,
                "gene" : "AGRN",
                "ref" : "A"
            }
        }
    }
}

And this is not what I need. I was trying to use $push to append a components array, but clearly $push is not respected in mapReduce().

Can anyone give me any pointers on how to take the input collection data above and create the desired output collection? Am I on the right track with my mapReduce() query?

Upvotes: 1

Views: 748

Answers (2)

Sede
Sede

Reputation: 61273

You really don't and should use mapReduce for this. You should instead you the .aggregate() method which provides access to the aggregation pipeline. All you need is $group your documents by "slug" and use the $push accumulator operator to return an array of all other fields. The $project stage is used to exclude the `_id' field from your aggregation result.

That being said you can use the $out operator the resulting documents of the aggregation pipeline to another collection as mentioned in @chridam's answer but because

You cannot specify a sharded collection as the output collection. The input collection for a pipeline can be sharded.

The $out operator cannot write results to a capped collection.

You should use "Bulk" operations to write you result to a new collections.

var bulk = db.newcollection.initializeUnorderedBulkOp();
db.collection.aggregate([
    { "$group": { 
        "_id": "$slug", 
        "components": {
            "$push": {
                "subslug": "$subslug",
                "score": "$score", 
                "position_start": "$position_start", 
                "position_end": "$position_end",
                "name": "$name", 
                "ref": "$ref"
            }
        }
    }},
    { "$project": { 
        "slug": "$_id",
        "components": 1, 
        "_id": 0
    }}
]).forEach(function(doc) {
        bulk.insert(doc);
})

bulk.execute();

Then db.newcollection.find() yields something like this:

{
        "_id" : ObjectId("563bc8a6bf93306f8f6638ce"),
        "components" : [
                {
                        "slug" : "1:955553",
                        "subslug" : "1:955553:02",
                        "score" : "0.615",
                        "position_start" : "1",
                        "position_end" : 955553,
                        "name" : "AGRN",
                        "ref" : "A"
                },
                {
                        "slug" : "1:955553",
                        "subslug" : "1:955553:01",
                        "score" : "0.715",
                        "position_start" : 1,
                        "position_end" : 955553,
                        "name" : "AGRN",
                        "ref" : "A"
                }
        ],
        "slug" : "1:955553"
}

Upvotes: 1

chridam
chridam

Reputation: 103445

Best to use the aggregation framework for such operation which should be multiple times faster than the map-reduce operation.

Typically you would construct an aggregation pipeline that consists of 3 stages:

  • $group stage - This pipeline step groups the documents by the slug field as its key, then apply the accumulator operator $push to create the components array that is a result of applying an expression to each document in the above group.
  • $project stage - This will reshape each document in the stream, such as by adding new fields or removing existing fields.
  • $out stage - This final step will writes the resulting documents of the aggregation pipeline to a new collection.

So, using the above operation, running the following aggregation pipeline will give you the desired result in a new collection called mytable:

db.vest.aggregate([
    {
        "$group": {
            "_id": "$slug",
            "components": {
                "$push": {
                    "subslug": "$subslug",
                     "score": "$score",
                     "position_start": "$position_start",
                     "position_end": "$position_end",
                     "name": "$name",
                     "ref": "$ref"
                }
            }
        }
    },
    {
        "$project": {
            "_id": 0, "slug": "$_id", "components": 1
        }
    },
    { "$out": "mytable" }
])

Querying this collection with the above sample data

db.mytable.find()

will give you the desired output:

Sample Output:

/* 0 */
{
    "_id" : ObjectId("563bc608d1f71f49c3d6c80b"),
    "components" : [ 
        {
            "subslug" : "2:15553:01",
            "score" : "0.915",
            "position_start" : "1002",
            "position_end" : 15553,
            "name" : "MMFR",
            "ref" : "C"
        }
    ],
    "slug" : "2:15553"
}

/* 1 */
{
    "_id" : ObjectId("563bc608d1f71f49c3d6c80c"),
    "components" : [ 
        {
            "subslug" : "1:955553:02",
            "score" : "0.615",
            "position_start" : "1",
            "position_end" : 955553,
            "name" : "AGRN",
            "ref" : "A"
        }, 
        {
            "subslug" : "1:955553:01",
            "score" : "0.715",
            "position_start" : 1,
            "position_end" : 955553,
            "name" : "AGRN",
            "ref" : "A"
        }
    ],
    "slug" : "1:955553"
}

Upvotes: 2

Related Questions