Reputation: 4019
I have a collection1
of documents with tags in MongoDB. The tags are an embedded array of strings:
{
name: 'someObj',
tags: ['tag1', 'tag2', ...]
}
I want to know the count of each tag in the collection. Therefore I have another collection2
with tag counts:
{
{
tag: 'tag1',
score: 2
}
{
tag: 'tag2',
score: 10
}
}
Now I have to keep both in sync. It is rather trivial when inserting to or removing from collection1
. However when I update collection1
I do the following:
1.) get the old document
var oldObj = collection1.find({ _id: id });
2.) calculate the difference between old and new tag arrays
var removedTags = $(oldObj.tags).not(obj.tags).get();
var insertedTags = $(obj.tags).not(oldObj.tags).get();
3.) update the old document
collection1.update(
{ _id: id },
{ $set: obj }
);
4.) update the scores of inserted & removed tags
// increment score of each inserted tag
insertedTags.forEach(function(val, idx) {
// $inc will set score = 1 on insert
collection2.update(
{ tag: val },
{ $inc: { score: 1 } },
{ upsert: true }
)
});
// decrement score of each removed tag
removedTags.forEach(function(val, idx) {
// $inc will set score = -1 on insert
collection2.update(
{ tag: val },
{ $inc: { score: -1 } },
{ upsert: true }
)
});
My questions:
A) Is this approach of keeping book of scores separately efficient? Or is there a more efficient one-time query to get the scores from collection1?
B) Even if keeping book separately is the better choice: can that be done in less steps, e.g. letting mongoDB calculate what tags are new / removed?
Upvotes: 2
Views: 1067
Reputation: 20703
The solution, as nickmilion correctly states, would be an aggregation. Though I would do it with a nack: we'll save its results in a collection. What will do is to trade real time results for an extreme speed boost.
More often than not, the need for real time results is overestimated. Hence, I'd go with precalculated stats for the tags and renew it every 5 minutes or so. That should be well enough, since most of such calls are requested async by the client and hence some delay in case the calculation has to be made on a specific request is negligible.
db.tags.aggregate(
{$unwind:"$tags"},
{$group: { _id:"$tags", score:{"$sum":1} } },
{$out:"tagStats"}
)
db.tagStats.update(
{'lastRun':{$exists:true}},
{'lastRun':new Date()},
{upsert:true}
)
db.tagStats.ensureIndex({lastRun:1}, {sparse:true})
Ok, here is the deal. First, we unwind the tags array, group it by the individual tags and increment the score for each occurrence of the respective tag. Next, we upsert lastRun
in the tagStats
collection, which we can do since MongoDB is schemaless. Next, we create a sparse index, which only holds values for documents in which the indexed field exists. In case the index already exists, ensureIndex
is an extremely cheap query; however, since we are going to use that query in our code, we don't need to create the index manually. With this procedure, the following query
db.tagStats.find(
{lastRun:{ $lte: new Date( ISODate().getTime() - 300000 ) } },
{_id:0, lastRun:1}
)
becomes a covered query: A query which is answered from the index, which tends to reside in RAM, making this query lightning fast (slightly less than 0.5 msecs median in my tests). So what does this query do? It will return a record when the last run of the aggregation was run more than 5 minutes ( 5*60*1000 = 300000 msecs) ago. Of course, you can adjust this to your needs.
Now, we can wrap it up:
var hasToRun = db.tagStats.find(
{lastRun:{ $lte: new Date( ISODate().getTime() - 300000 ) } },
{_id:0, lastRun:1}
);
if(hasToRun){
db.tags.aggregate(
{$unwind:"$tags"},
{$group: {_id:"$tags", score:{"$sum":1} } },
{$out:"tagStats"}
)
db.tagStats.update(
{'lastRun':{$exists:true}},
{'lastRun':new Date()},
{upsert:true}
);
db.tagStats.ensureIndex({lastRun:1},{sparse:true});
}
// For all stats
var tagsStats = db.tagStats.find({score:{$exists:true}});
// score for a specific tag
var scoreForTag = db.tagStats.find({score:{$exists:true},_id:"tag1"});
If real time results really matter and you need the stats for all the tags, simply use the aggregation without saving it to another collection:
db.tags.aggregate(
{$unwind:"$tags"},
{$group: { _id:"$tags", score:{"$sum":1} } },
)
If you only need the results for one specific tag at a time, a real time approach could be to use a special index, create a covered query and simply count the results:
db.tags.ensureIndex({tags:1})
var numberOfOccurences = db.tags.find({tags:"tag1"},{_id:0,tags:1}).count();
Upvotes: 3
Reputation: 1372
answering your questions:
Upvotes: 0