nlyn
nlyn

Reputation: 606

Collaborative Filtering / Recommendation System performance and approaches

I'm really interested to find out how people approach collaborative filtering and recommendation engines etc. I mean this more in terms of performance of the script than anything. I have stated reading Programming Collective Intelligence, which is really interesting but tends to focus more on the algorithmic side of things.

I currently only have 2k users, but my current system is proving to be totally not future proof and very taxing on the server already. The entire system is based on making recommendations of posts to users. My application is PHP/MySQL but I use some MongoDB for my collaborative filtering stuff - I'm on a large Amazon EC2 instance. My setup is really a 2 step process. First I calculate similarities between items, then I use this information to make recommendations. Here's how it works:

First my system calculates similarities between users posts. The script runs an algorithm which returns a similarity score for each pair. The algorithm examines information such as - common tags, common commenters and common likers and is able to return a similarity score. The process goes like:

I have 5k posts. So all of the above is quite hard on the server. There's a huge amount of reads and writes to be performed. Now, this is only half the issue. I then use this information to work out what posts would be interesting to a particular user. So, once an hour I run a cron script which runs a script that calculates 1 recommended post for each user on the site. The process goes like so:

Unfortunately the hourly recommendation script is getting very resource intensive and is slowly taking longer and longer to complete... currently 10-15 minutes. I'm worried that at some point I won't be able to provide hourly recommendations anymore.

I'm just wondering if anyone feels I could be approaching this any better?

Upvotes: 4

Views: 3446

Answers (2)

JasonG
JasonG

Reputation: 5962

I'm starting to plan how to do this. First thing is to possibly get rid of your database technology or supplement it with either triplestore or graph technologies. That should provide some better performance for analyzing similar likes or topics.

Next yes get a subset. Take a few interests that the user has and get a small pool of users that have similar interests.

Then build indexes of likes in some sort of meaningful order and count the inversions (divide and conquer - this is pretty similar to merge sort and you'll want to sort on your way out to count split inversions anyways).

I hope that helps - you don't want to compare everything to everything else or it's definately n2. You should be able to replace that with something somwhere between constant and linear if you take sets of people who have similar likes and use that.

For example, from a graph perspective, take something that they recently liked, and look at the in edges and then go trace them out and just analyze those users. Maybe do this on a few recently liked articles and then find a common set of users from that and use that for the collaborative filtering to find articles the user would likely enjoy. then you're at a workable problem size - especially in graph where there is no index growth (although maybe more in edges to traverse on the article - that just gives you more change of finding usable data though)

Even better would be to key the articles themselves so that if an article was liked by someone you can see articles that they may like based on other users (ie Amazon's 'users that bought this also bought').

Hope that gives a few ideas. For graph analysis there are some frameworks that may help like faunus for stats and derivitions.

Upvotes: 1

symcbean
symcbean

Reputation: 48357

With 5000 posts, that's 25,000,000 relationships, increasing O(n^2).

Your first problem is how you can avoid examining so many relationships every time the batch runs. Using tags or keywords will help with content matching - and you could use date ranges to limit common 'likes'. Beyond that....we'd to know a lot more about the methodology for establishing relationships.

Another consideration is when you establish relationships. Why are you waiting until the batch runs to compare a new post with existing data? Certainly it makes sense to handle this asynchronously to ensure that the request is processed quickly - but (other than the restrictions imposed by your platform) why wait until the batch kicks in before establishing the relationships? Use an asynchronous message queue.

Indeed depending on how long it takes to process a message, there may even be a case for re-generating cached relationship data when an item is retrieved rather than when it is created.

And if I were writing a platform to measure relationships with data then (the clue is in the name) I'd definitely be leaning towards a relational database where joins are easy and much of the logic can be implemented on the database tier.

It's certainly possible to reduce the length of time the system takes to cross-reference the data. This is exactly the kind of problem map-reduce is intended to address - but the benefits of this mainly come from being to run the algorithm in prallel across lots of machines - at the end of the day it takes just as many clock ticks.

Upvotes: 2

Related Questions