Reputation: 875
I currently have an implementation of a recommender in mahout using the in memory recommendation apis. However, I would like to move to a distributed solution using hadoop in order to calculate offline recommendations. This is my first time using hadoop and I'm looking for clarification on a few concepts and api usages.
Currently, my understanding of hadoop is minimal and I think that the correct approach is the following:
use something like apache drill in order to populate the hdfs with the user and item data.
using the recommendation job in mahout train on the data from the hdfs.
transform the resultant data in the hdfs to index shards to be used by solr
use solr to provide the recommendations to the userbase
However, I am looking for clarifications on a couple aspects of this design:
How would I utilize a rescorer in the manner that it is used in the in memory live recommendations?
What is the best manner in which to invoke the recommendations job?
I have other questions besides these two but the answers to these would be a huge help.
Upvotes: 0
Views: 174
Reputation: 5702
You may be talking about the Mahout + Hadoop + Solr recommender. This method handles rescoring in a couple different ways.
The basic recommender can be put together in two ways:
The way I did the demo site (https://guide.finderbots.com) is to use my Solr web app integration, putting the indicator matrix into a DB attaching the similar item list to my collection of items. So item123 got item223 item643 item293 item445... in its indicator field. After you index the collection the query is then = "item223 item344 item445..." -- the user's prefered items.
Here are three ways to do rescoring:
The fun thing is you can mix filters, metadata fields, and user preferences with what Solr calls "boost" values to get all sorts of rescoring. Solr can even use location to query, skew, or filter.
Note: You don't have to worry about Solr shards necessarily. Solr will index most DBs and HDFS directly but only the index is sharded. You shard an index if you have a very big one, you replicate it if you have lots of queries/second (or for failover). Solr queries are generally very fast so I'd worry about that after you have a functioning system since it's a config thing and shouldn't be affected by the rest of your workflow.
Upvotes: 2