Reputation: 5380
Does anybody know how I could transform the code found on the Mahout in Action book, regarding the recommendation engines, so that it is consistent with a Ηadoop fully-distributed environment? My main difficulty is to transform my code (that currently reads and writes from a local disk) so that it runs in a pseudo-distributed environment (such Cloudera). Is the solution to my problem as simple as this one, or I should expect something more complex than that?
Upvotes: 1
Views: 655
Reputation: 66891
A truly distributed computation is quite different than a non-distributed computation, even when computing the same result. The structure is not the same, and the infrastructure it uses is not the same.
If you are just asking how the pseudo-distributed solution works regarding local files: you would ignore the Hadoop input/output mechanism and write a Mapper
that reads your input from somewhere on HDFS and copies to local disk.
If you are asking how you actually distribute the computation, then you would have to switch to use the (completely-different) distributed implementations in the project. These actually use Hadoop to split up the computation. The process above is a hack that just runs many non-distributed tasks within a Hadoop container. These implementations are however completely off-line.
If you mean that you want a real-time recommender like in the Mahout .cf.taste
packages, but also want to actually use Hadoop's distributed computing power, then you need more than Mahout. It's either one or the other in Mahout; there is code that does one or the other but they are not related.
This is exactly what Myrrix is, by the way. I don't mind advertising it here since it sounds like exactly what you may be looking for. It's an evolution of the work I began in this Mahout code. Among other things, it's a 2-tier architecture that has the real-time elements of Taste but can also transparently offload the computation to a Hadoop cluster.
Upvotes: 3