Reputation: 3463
First of all it's my first time in Mongo...
Concept:
My words document (currently) is as follows (example)
{
"date": "date it was inserted"
"reported": 0,
"image_id": "image id"
"image_name": "image name"
"user": "user _id"
"word": "awesome"
}
The words will be duplicated so that each word can be associated to a user...
Problem: I need to perform a Mongo query to allow me to know the most used words (to describe an image) that were not created by a given user. (to meet point 3. above)
I've seen MapReduce algorithm, but from what I read there are a couple of issues with it:
I've thought about running a task at a given time each day to store on a document (in a different collection) the list the rank of words that a given user hasn't used to describe the given image. I would have to limit this to 300 results or something (any idea on a proper limit??) Something like:
{
user_id: "the user id"
[
{word: test, count: 1000},
{word: test2, count: 980},
{word: etc, count: 300}
]
}
Problems I see with this solution are:
Maybe my approach doesn't make any sense... And maybe my lack of experience in Mongo is pointing me at the wrong "schema design".
Any idea of what could be a good approach for this kind of problem?
Sorry for the big post and thanks for your time and help!
João
Upvotes: 3
Views: 1836
Reputation: 1093
As already mentioned you could use the group command which is easy to use, but you will need to sort the result on the client side. Also the result is returned as a single BSON object and for this reason must be fairly small – less than 10,000 keys, else you will get an exception.
Code example based on your data structure:
db.words.group({
key : {"word" : true},
initial: {count : 0},
reduce: function(obj, prev) { prev.count++},
cond: {"user" :{ $ne : "USERNAME_TO_IGNORE"}}
})
Another option is to use the new Aggregation framework, which will be released in the 2.2 version. Something like that should work.
db.words.aggregate({
$match : { "user" : { "$ne" : "USERNAME_TO_IGNORE"} },
$group : {
_id : "$word",
count: { $sum : 1}
}
})
Or you can still use MapReduce. Actually you can limit and sort the output, because the result is an collection. Just use .sort() and .limit() on the output. Also you can use the incremental map-reduce output option, which will help you solve your performance issues. Have a look at the out parameter in MapReduce.
Bellow it's an example, which use the incremental feature to merge the existing collection with new data in a words_usage collection:
m = function() {
emit(this.word, {count: 1});
};
r = function( key , values ){
var sum = 0;
values.forEach(function(doc) {
sum += doc.count;
});
return {count: sum};
};
db.runCommand({
mapreduce : "words",
map : m,
reduce : r,
out : { reduce: "words_usage"},
query : <query filter object>
})
# retrieve the top 10 words
db.words_usage.find().sort({"value.count" : -1}).sort({"value.count" : -1}).limit(10)
I guess you can run the above MapReduce command in a cron every few minutes/hours, depends how accurate results you want. For the update query criteria you can use the words documents creation date.
Once you have the system top words collection you can build per user top words or just compute them in real time (depends on the system size).
Upvotes: 3
Reputation: 4423
The group
function is supposed to be a simpler version of MapReduce
. You could use it like this to get a sum for each word:
db.coll.group(
{key: { a:true, b:true },
cond: { active:1 },
reduce: function(obj,prev) { prev.csum += obj.c; },
initial: { csum: 0 }
});
Upvotes: 1