B M
B M

Reputation: 4019

documents with tags in mongodb: getting tag counts

I have a collection1 of documents with tags in MongoDB. The tags are an embedded array of strings:

{
    name: 'someObj',
    tags: ['tag1', 'tag2', ...]
}

I want to know the count of each tag in the collection. Therefore I have another collection2 with tag counts:

{
    { 
        tag: 'tag1',
        score: 2
    }
    { 
        tag: 'tag2',
        score: 10
    }
}

Now I have to keep both in sync. It is rather trivial when inserting to or removing from collection1. However when I update collection1 I do the following:

1.) get the old document

var oldObj = collection1.find({ _id: id });

2.) calculate the difference between old and new tag arrays

var removedTags = $(oldObj.tags).not(obj.tags).get();
var insertedTags = $(obj.tags).not(oldObj.tags).get();

3.) update the old document

collection1.update(
    { _id: id },
    { $set: obj }
);

4.) update the scores of inserted & removed tags

// increment score of each inserted tag
insertedTags.forEach(function(val, idx) {
    // $inc will set score = 1 on insert
    collection2.update(
        { tag: val },
        { $inc: { score: 1 } },
        { upsert: true }
    )
});
// decrement score of each removed tag
removedTags.forEach(function(val, idx) {
    // $inc will set score = -1 on insert
    collection2.update(
        { tag: val },
        { $inc: { score: -1 } },
        { upsert: true }
    )
});

My questions:

A) Is this approach of keeping book of scores separately efficient? Or is there a more efficient one-time query to get the scores from collection1?

B) Even if keeping book separately is the better choice: can that be done in less steps, e.g. letting mongoDB calculate what tags are new / removed?

Upvotes: 2

Views: 1067

Answers (2)

Markus W Mahlberg
Markus W Mahlberg

Reputation: 20703

The solution, as nickmilion correctly states, would be an aggregation. Though I would do it with a nack: we'll save its results in a collection. What will do is to trade real time results for an extreme speed boost.

How I would do it

More often than not, the need for real time results is overestimated. Hence, I'd go with precalculated stats for the tags and renew it every 5 minutes or so. That should be well enough, since most of such calls are requested async by the client and hence some delay in case the calculation has to be made on a specific request is negligible.

db.tags.aggregate(
  {$unwind:"$tags"},
  {$group: { _id:"$tags", score:{"$sum":1} } },
  {$out:"tagStats"}
)
db.tagStats.update(
  {'lastRun':{$exists:true}},
  {'lastRun':new Date()},
  {upsert:true}
)

db.tagStats.ensureIndex({lastRun:1}, {sparse:true})

Ok, here is the deal. First, we unwind the tags array, group it by the individual tags and increment the score for each occurrence of the respective tag. Next, we upsert lastRun in the tagStats collection, which we can do since MongoDB is schemaless. Next, we create a sparse index, which only holds values for documents in which the indexed field exists. In case the index already exists, ensureIndex is an extremely cheap query; however, since we are going to use that query in our code, we don't need to create the index manually. With this procedure, the following query

db.tagStats.find(
 {lastRun:{ $lte: new Date( ISODate().getTime() - 300000 ) } },
 {_id:0, lastRun:1}
)

becomes a covered query: A query which is answered from the index, which tends to reside in RAM, making this query lightning fast (slightly less than 0.5 msecs median in my tests). So what does this query do? It will return a record when the last run of the aggregation was run more than 5 minutes ( 5*60*1000 = 300000 msecs) ago. Of course, you can adjust this to your needs.

Now, we can wrap it up:

var hasToRun = db.tagStats.find(
  {lastRun:{ $lte: new Date( ISODate().getTime() - 300000 ) } },
  {_id:0, lastRun:1}
);

if(hasToRun){

  db.tags.aggregate(
    {$unwind:"$tags"},
    {$group: {_id:"$tags", score:{"$sum":1} } },
    {$out:"tagStats"}
  )

  db.tagStats.update(
    {'lastRun':{$exists:true}},
    {'lastRun':new Date()},
    {upsert:true}
  );

  db.tagStats.ensureIndex({lastRun:1},{sparse:true});

}
// For all stats
var tagsStats = db.tagStats.find({score:{$exists:true}});
// score for a specific tag
var scoreForTag = db.tagStats.find({score:{$exists:true},_id:"tag1"});

Alternative approach

If real time results really matter and you need the stats for all the tags, simply use the aggregation without saving it to another collection:

db.tags.aggregate(
  {$unwind:"$tags"},
  {$group: { _id:"$tags", score:{"$sum":1} } },
)    

If you only need the results for one specific tag at a time, a real time approach could be to use a special index, create a covered query and simply count the results:

db.tags.ensureIndex({tags:1})
var numberOfOccurences = db.tags.find({tags:"tag1"},{_id:0,tags:1}).count();

Upvotes: 3

nickmilon
nickmilon

Reputation: 1372

answering your questions:

  • B): you don't have to calculate the dif yourself use $addToSet
  • A): you can get the counts via aggregation framework with a combination of $unwind and $count

Upvotes: 0

Related Questions