JackSparrow
JackSparrow

Reputation: 707

K-Means Algorithm Hadoop

I am implementing K-Means Algorithm on Hadoop(Old API) and I am stuck at a point where I am not able to figure out how to proceed further.

My logic till now is:

Maintain two file centroids & data.

My Questions

  1. Is my flow correct?
  2. Is it right way to store centroid file first in some collection and then proceed further?
  3. If I go with approach (2) then my problem is that how to store this centroid file in some collection because as map function will scan file line by line so how to scan this centroid file first before running the mapper on data file?
  4. What should be a better way to handle this centroid file functionality?

Upvotes: 2

Views: 3979

Answers (2)

greedybuddha
greedybuddha

Reputation: 7507

  1. Your flow looks almost alright to me. You shouldn't be "reading" your data file, you should just be specifying that as your input to the hadoop program. Besides that, while not always ideal MapReduce/Hadoop systems can do Iterative algorithms just fine. Just make sure you link the output of the reducer to the input of the mapper.

  2. Yes it is correct, You can use either the distribute cache, or just store the file in the hdfs path. Reading in the centroid file at the start of a mapper can be done inside of the "setup" method.

    public class MyMapper extends Mapper{
        @Override
        protected void setup(Context context){
            //setup code.  Read in centroid file
        }
    }
    
  3. See my code above, use the "setup" method.

  4. I think the centroid file is fine, that file is going to be trivial to pass around due to its very small size (unless you are trying to have millions of clusters...)

Upvotes: 2

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25909

Hadoop is not the best fit for K-means as you need to run several map-reduces to get the result. That said Apache Mahout includes a k-means implementation for Hadoop and you can use it (or look at its implementation).

Apache Hama which is a Bulk Synchronous Parallel processing framework on top of Hadoop is a better fit as it allows message passing between participating componets and it has k-means implementation [3]. It would be more common to see solutions like this on Hadoop 2 when YARN will be out of beta

Upvotes: 1

Related Questions