tweak2
tweak2

Reputation: 656

Get count of siblings in subdocument with mongodb aggregate query

I have a document collection with a subdocument of tags.

{
    title:"my title",
    slug:"my-title",
    tags:[
        {tagname:'tag1', id:1},
        {tagname:'tag2', id:2},
        {tagname:'tag3', id:3}]
}
{
    title:"my title2",
    slug:"my-title2",
    tags:[
        {tagname:'tag1', id:1},
        {tagname:'tag2', id:2}]
}
{
    title:"my title3",
    slug:"my-title3",
    tags:[
        {tagname:'tag1', id:1},
        {tagname:'tag3', id:3}]
}
{
    title:"my title4",
    slug:"my-title4",
    tags:[
        {tagname:'tag1', id:1},
        {tagname:'tag2', id:2},
        {tagname:'tag3', id:3}]
}

[...]

Getting a count of every tag is quite simple with an $unwind + group count aggregate

However, I would like to find a count of which tags are found together, or more precisely, which sibling shows up most often beside one another, ordered by count. I have not found an example nor can I figure out how to do this without multiple queries.

Ideally the end result would be:

{'tag1':{
    'tag2':3, // tag1 and tag2 were found in a document together 3 times
    'tag3':3, // tag1 and tag3 were found in a document together 3 times
    [...]}}

{'tag2':{
    'tag1':3, // tag2 and tag1 were found in a document together 3 times
    'tag3':2, // tag2 and tag3 were found in a document together 2 times
    [...]}}

{'tag3':{
    'tag1':3, // tag3 and tag1 were found in a document together 3 times
    'tag2':2, // tag3 and tag2 were found in a document together 2 times
    [...]}}

[...]

Upvotes: 3

Views: 984

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151170

As stated earlier it just simply is not possible to have the aggregation framework generate arbitrary key names from data. It's also not possible to do this kind of analysis in a single query.

But there is a general approach to doing this over your whole collection for an undetermined number of tag names. Essentially you are going to need to get a distinct list of the "tags" and process another query for each distinct value to get the "siblings" to that tag and the counts.

In general:

// Get a the unique tags
db.collection.aggregate([
    { "$unwind": "$tags" },
    { "$group": {
        "_id": "$tags.tagname"
    }}
]).forEach(function(tag) {
    var tagDoc = { };
    tagDoc[tag._id] = {};

    // Get the siblings count for that tag
    db.collection.aggregate([
        { "$match": { "tags.tagname": tag._id } },
        { "$unwind": "$tags" },
        { "$match": { "tags.tagname": { "$ne": tag._id } } },
        { "$group": {
            "_id": "$tags.tagname",
            "count": { "$sum": 1 }
        }}
    ]).forEach(function(sibling) {
          // Set the value in the master document
          tagDoc[tag._id][sibling._id] = sibling.count;   
    });
    // Just emitting for example purposes in some way
    printjson(tagDoc);
});

The aggregation framework can return a cursor in releases since MongoDB 2.6, so even with a large number of tags this can work in an efficient way.

So that's the way you would handle this, but there really is no way to have this happen in a single query. For a shorter run time you might look at frameworks that allow many queries to be run in parallel either combining the results or emitting to a stream.

Upvotes: 2

Related Questions