Reputation: 656
I have a document collection with a subdocument of tags.
{
title:"my title",
slug:"my-title",
tags:[
{tagname:'tag1', id:1},
{tagname:'tag2', id:2},
{tagname:'tag3', id:3}]
}
{
title:"my title2",
slug:"my-title2",
tags:[
{tagname:'tag1', id:1},
{tagname:'tag2', id:2}]
}
{
title:"my title3",
slug:"my-title3",
tags:[
{tagname:'tag1', id:1},
{tagname:'tag3', id:3}]
}
{
title:"my title4",
slug:"my-title4",
tags:[
{tagname:'tag1', id:1},
{tagname:'tag2', id:2},
{tagname:'tag3', id:3}]
}
[...]
Getting a count of every tag is quite simple with an $unwind + group count aggregate
However, I would like to find a count of which tags are found together, or more precisely, which sibling shows up most often beside one another, ordered by count. I have not found an example nor can I figure out how to do this without multiple queries.
Ideally the end result would be:
{'tag1':{
'tag2':3, // tag1 and tag2 were found in a document together 3 times
'tag3':3, // tag1 and tag3 were found in a document together 3 times
[...]}}
{'tag2':{
'tag1':3, // tag2 and tag1 were found in a document together 3 times
'tag3':2, // tag2 and tag3 were found in a document together 2 times
[...]}}
{'tag3':{
'tag1':3, // tag3 and tag1 were found in a document together 3 times
'tag2':2, // tag3 and tag2 were found in a document together 2 times
[...]}}
[...]
Upvotes: 3
Views: 984
Reputation: 151170
As stated earlier it just simply is not possible to have the aggregation framework generate arbitrary key names from data. It's also not possible to do this kind of analysis in a single query.
But there is a general approach to doing this over your whole collection for an undetermined number of tag names. Essentially you are going to need to get a distinct list of the "tags" and process another query for each distinct value to get the "siblings" to that tag and the counts.
In general:
// Get a the unique tags
db.collection.aggregate([
{ "$unwind": "$tags" },
{ "$group": {
"_id": "$tags.tagname"
}}
]).forEach(function(tag) {
var tagDoc = { };
tagDoc[tag._id] = {};
// Get the siblings count for that tag
db.collection.aggregate([
{ "$match": { "tags.tagname": tag._id } },
{ "$unwind": "$tags" },
{ "$match": { "tags.tagname": { "$ne": tag._id } } },
{ "$group": {
"_id": "$tags.tagname",
"count": { "$sum": 1 }
}}
]).forEach(function(sibling) {
// Set the value in the master document
tagDoc[tag._id][sibling._id] = sibling.count;
});
// Just emitting for example purposes in some way
printjson(tagDoc);
});
The aggregation framework can return a cursor in releases since MongoDB 2.6, so even with a large number of tags this can work in an efficient way.
So that's the way you would handle this, but there really is no way to have this happen in a single query. For a shorter run time you might look at frameworks that allow many queries to be run in parallel either combining the results or emitting to a stream.
Upvotes: 2