Assaf Hershko
Assaf Hershko

Reputation: 1924

MongoDB Aggregation Framework

I have a document that's structured as follows:

{
  '_id' => 'Star Wars',
  'count' => 1234,
  'spelling' => [ ( 'Star wars' => 10, 'Star Wars' => 15, 'sTaR WaRs' => 5) ]
}

I would like to get the top N documents (by descending count), but with only one one spelling per document (the one with the highest value). It there a way to do this with the aggregation framework?

I can easily get the top 10 results (using $sort and $limit). But how do I get only one spelling per each?

So for example, if I have the following three records:

{
  '_id' => 'star_wars',
  'count' => 1234,
  'spelling' => [ ( 'Star wars' => 10, 'Star Wars' => 15, 'sTaR WaRs' => 5) ]
}
{
  '_id' => 'willow',
  'count' => 2211,
  'spelling' => [ ( 'willow' => 300, 'Willow' => 550) ]
}
{
  '_id' => 'indiana_jones',
  'count' => 12,
  'spelling' => [ ( 'indiana Jones' => 10, 'Indiana Jones' => 25, 'indiana jones' => 5) ]
}

And I ask for the top 2 results, I'll get:

{
  '_id' => 'willow',
  'count' => 2211,
  'spelling' => 'Willow'
}
{
  '_id' => 'star_wars',
  'count' => 1234,
  'spelling' => 'Star Wars'
}

(or something to this effect)

Thanks!

Upvotes: 0

Views: 369

Answers (1)

WiredPrairie
WiredPrairie

Reputation: 59763

Your schema as designed would make using anything but a MapReduce difficult as you've used the keys of the object as values. So, I adjusted your schema to better match with MongoDB's capabilities (in JSON format as well for this example):

{
  '_id' : 'star_wars',
  'count' : 1234,
  'spellings' : [ 
    { spelling: 'Star wars', total: 10}, 
    { spelling: 'Star Wars', total : 15}, 
    { spelling: 'sTaR WaRs', total : 5} ]
}

Note that it's now an array of objects with a specific key name, spelling, and a value for the total (I didn't know what that number actually represented, so I've called it total in my examples).

On to the aggregation:

db.so.aggregate([
    { $unwind: '$spellings' }, 
    { $project: { 
        'spelling' : '$spellings.spelling', 
        'total': '$spellings.total', 
        'count': '$count'  
        }
    }, 
    { $sort : { total : -1 } }, 
    { $group : { _id : '$_id',
        count: { $first: '$count' },
        largest : { $first : '$total' },
        spelling : { $first: '$spelling' }
        }
    }
])
  1. Unwind all of the data so the aggregation pipeline can access the various values of the array
  2. Flatten the data to include the key aspects needed by the pipeline. In this case, the specific spelling, the total, and the count.
  3. Sort on the total, so that the last grouping can use $first
  4. Then, group so that only the $first value for each _id is returned, and then also return the count which because of the way it was flattened for the pipeline, each temporary document will contain the count field.

Results:

[
{
    "_id" : "star_wars",
    "count" : 1234,
    "largest" : 15,
    "spelling" : "Star Wars"
},
{
    "_id" : "indiana_jones",
    "count" : 12,
    "largest" : 25,
    "spelling" : "Indiana Jones"
},
{
    "_id" : "willow",
    "count" : 2211,
    "largest" : 550,
    "spelling" : "Willow"
}
]

Upvotes: 2

Related Questions