Reputation: 677
First of all: I'm using Mongo 2.6 and Mongoose 3.8.8
I have the follow Schema:
var Link = new Schema({
title: { type: String, trim: true },
owner: { id: { type: Schema.ObjectId }, name: { type: String } },
url: { type: String, default: '', trim: true},
stars: { users: [ { name: { type: String }, _id: {type: Schema.ObjectId} }] },
createdAt: { type: Date, default: Date.now }
});
And my collection already have 500k documents.
What I need is sort the documents using a custom strategy. My initial solution was use the aggregate framework.
var today = new Date();
//fx = (TodayDay * TodayYear) - ( DocumentCreatedDay * DocumentCreatedYear)
var relevance = { $subtract: [
{ $multiply: [ { $dayOfYear: today }, { $year: today } ] },
{ $multiply: [ { $dayOfYear: '$createdAt' }, { $year: '$createdAt' } ] }
]}
var projection = {
_id: 1,
url: 1,
title: 1,
createdAt: 1,
thumbnail: 1,
stars: { $size: '$stars.users'}
ranking: { $multiply: [ relevance, { $size: '$stars.users' } ] }
}
var sort = {
$sort: { ranking: 1, stars: 1 }
}
var page = 1;
var limit = { $limit: 40 }
var skip = { $skip: ( 40 * (page - 1) ) }
var project = { $project: projection }
Link.aggregate([project, sort, limit, skip]).exec(resultCallback);
It works nicely until 100k, after that the query is getting slow and slow.
How I could accomplish that ?
Redesign ?
Wrong use of projection Am I doing ?
Thanks for your time !
Upvotes: 0
Views: 1230
Reputation: 151072
You can do all of this as you update and then you can actually index on ranking and use range queries in order to implement your paging. Much better than the use of $skip
and $limit
which in any form is bad news for large data. You should be able to find many sources that confirm that skip and limit is a poor practice for paging.
The only catch here is since you cannot use an .update()
type of statement to actually refer to the existing value of another field, you have to be careful with concurrency issues on updates. This required "rolling in" some custom lock handling which you can do with the .findOneAndUpdate()
method:
Link.findOneAndUpdate(
{ "_id": docId, "locked": false },
{ "locked": true },
function(err,doc) {
if ( doc.locked.true ) {
// then update your document
// I would just use the epoch date difference per day
var relevance = (
( Date.now.valueOf() - ( Date.now().valueOf() % 1000 * 60 * 60 * 24) )
- ( doc.createdAt.valueOf() - ( doc.createdAt.valueOf() % 1000 * 60 * 60 * 24 ))
);
var update = { "$set": { "locked": false } };
if ( actionAdd ) {
update["$push"] = { "stars.users": star };
update["$set"]["score"] = relevance * ( doc.stars.users.length +1 );
} else {
update["$pull"] = { "stars.users": star };
update["$set"]["score"] = relevance * ( doc.stars.users.length -1 );
}
// Then update
Link.findOneAndUpdate(
{ "_id": doc._id, "locked": update,function(err,newDoc) {
// possibly check that new "locked" is false, but really
// that should be okay
});
} else {
// some mechanism to retry "n" times at interval
// or report that you cannot update
}
}
)
The idea there is that you can only grab a document with a "locked" status equal to false
in order to actually update, and the first "update" operation just sets that value to true
so that no other operation could update the document until this completes.
As per the code comments, you probably want to have a few tries at doing this rather than just failing the update as there could be another operation adding or subtracting from the array.
Then depending on the "mode" of your current update if you are either adding to the array or taking an item off of there you simply alter the update statement to be issued to do either operation and set the appropriate "score" value in your document.
The update will then of course set the "locked" status to false
and it makes sense to check that the current status is not true
though it really should be okay at this point. But this gives you some room on being able to raise exceptions.
That manages the general update situation but you still have a problem with sorting out your "ranking" order here as skip and limit are still not what you want for performance. That is probably best handled by a periodic update of yet another field which you can use for a definitive "range" query, but you probably only really want to be concerned with the the most "relevant" score range in a set range of pages, rather than update the whole collection.
The update needs to be periodic as you will have concurrency problems if you try to change the "ranking" order of multiple documents in individual updates. So you need to make sure this process does not overlap with another such update.
As a final note consider your "score" calculation as what you really want is the newest and "most starred" content at the top. The current calculation has some flaws there such as on the same day and 0 "stars", but I'll leave that to you to work out.
This is essentially what you need to do for your solution. Trying to do this dynamically on a large collection using the aggregation framework is not going to produce favorable performance for your application experience. So there are few pointers here to things you can do to more efficiently maintain the order of your results.
Upvotes: 2