K-Means Algorithm Hadoop

Question

I am implementing K-Means Algorithm on Hadoop(Old API) and I am stuck at a point where I am not able to figure out how to proceed further.

My logic till now is:

Maintain two file centroids & data.

Step 1: Read the centroids file & store this data in some list(ArrayList).
Step 2: Then read the data file through mapper as it will scan line by line then compare this value with already stored centroids in the list.
Step 3: Output the corresponding centroid & data to reducer.
Step 4: Reducer will process the new centroid and emit it with the data.

My Questions

Is my flow correct?
Is it right way to store centroid file first in some collection and then proceed further?
If I go with approach (2) then my problem is that how to store this centroid file in some collection because as map function will scan file line by line so how to scan this centroid file first before running the mapper on data file?
What should be a better way to handle this centroid file functionality?

greedybuddha · Accepted Answer

Your flow looks almost alright to me. You shouldn't be "reading" your data file, you should just be specifying that as your input to the hadoop program. Besides that, while not always ideal MapReduce/Hadoop systems can do Iterative algorithms just fine. Just make sure you link the output of the reducer to the input of the mapper.
Yes it is correct, You can use either the distribute cache, or just store the file in the hdfs path. Reading in the centroid file at the start of a mapper can be done inside of the "setup" method.
```
public class MyMapper extends Mapper{
    @Override
    protected void setup(Context context){
        //setup code.  Read in centroid file
    }
}
```
See my code above, use the "setup" method.
I think the centroid file is fine, that file is going to be trivial to pass around due to its very small size (unless you are trying to have millions of clusters...)