Reputation: 707
I am implementing K-Means Algorithm on Hadoop(Old API) and I am stuck at a point where I am not able to figure out how to proceed further.
My logic till now is:
Maintain two file centroids & data.
list(ArrayList)
.Upvotes: 2
Views: 3979
Reputation: 7507
Your flow looks almost alright to me. You shouldn't be "reading" your data file, you should just be specifying that as your input to the hadoop program. Besides that, while not always ideal MapReduce/Hadoop systems can do Iterative algorithms just fine. Just make sure you link the output of the reducer to the input of the mapper.
Yes it is correct, You can use either the distribute cache, or just store the file in the hdfs path. Reading in the centroid file at the start of a mapper can be done inside of the "setup" method.
public class MyMapper extends Mapper{ @Override protected void setup(Context context){ //setup code. Read in centroid file } }
See my code above, use the "setup" method.
I think the centroid file is fine, that file is going to be trivial to pass around due to its very small size (unless you are trying to have millions of clusters...)
Upvotes: 2
Reputation: 25909
Hadoop is not the best fit for K-means as you need to run several map-reduces to get the result. That said Apache Mahout includes a k-means implementation for Hadoop and you can use it (or look at its implementation).
Apache Hama which is a Bulk Synchronous Parallel processing framework on top of Hadoop is a better fit as it allows message passing between participating componets and it has k-means implementation [3]. It would be more common to see solutions like this on Hadoop 2 when YARN will be out of beta
Upvotes: 1