SaB
SaB

Reputation: 787

How can you integrate Mahout KMeans Clustering Into application?

I am trying o use Mahout KMeans for a simple application. I manually create a series of Vectors from database content. I simply want to feed these vectors to Mahout (0.9) for example KMeansClusterer and use the output.

I read Mahout in Action (examples from version 0.5) and many online fora to get background. But I can see no way any longer to use Mahout KMeans (or related clustering) without file name and file path usages via Hadoop. The documentation is very sketchy, but can Mahout be used in this way any more? Are there any current examples of using Mahout KMeans (not from command line).

    private List<Cluster> kMeans(List<Vector> allvectors, double closeness, int numclusters, int iterations) {
    List<Cluster> clusters = new ArrayList<Cluster>() ; 

    int clusterId = 0;
    for (Vector v : allvectors) {
        clusters.add(new Kluster(v, clusterId++, new EuclideanDistanceMeasure()));
    }

    List<List<Cluster>> finalclusters = KMeansClusterer.clusterPoints(allvectors, clusters, 0.01, numclusters, 10) ;  


    for(Cluster cluster : finalclusters.get(finalclusters.size() - 1)) {
        System.out.println("Fuzzy Cluster id: " + cluster.getId() + " center: " + cluster.getCenter().asFormatString());
    }

    return clusters ;
}

Upvotes: 2

Views: 404

Answers (1)

rkmalaiya
rkmalaiya

Reputation: 508

First you need to write your vectors to Seq file. Below is the code:

List<VectorWritable> vectors = new ArrayList<>();
double[] vectorValues = {<your vector values>};
vectors.add(new VectorWritable(new NamedVector(new DenseVector(vectorValues), userName)));

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
fs = FileSystem.get(new File(writeFile).toURI(), conf);
writer = new SequenceFile.Writer(fs, conf, new Path(writeFile), Text.class, VectorWritable.class);

try {
      int i = 0;
      for (VectorWritable vw : vectors) {
        writer.append(new Text("mapred_" + i++), vw);
      }
    } finally {
      Closeables.close(writer, false);
    }

Then use the below lines to generate clusters. You need to supply initial cluster to KMeans, hence I have used Canopy to generate initial cluster.

However, you will not be able to understand the output of cluster because it is in Seq file format. You need to execute ClusterDumper class in Mahout-Integration.jar to finally read and understand your clusters.

Configuration conf = new Configuration(); 
CanopyDriver.run(conf, new Path(inputPath), new Path(canopyOutputPath), new ManhattanDistanceMeasure(), (double) 3.1, (double) 2.1, true, (double) 0.5, true );

                    // now run the KMeansDriver job
KMeansDriver.run(conf, new Path(inputPath), new Path(canopyOutputPath + "/clusters-0-final/"), new Path(kmeansOutput), new EuclideanDistanceMeasure(), 0.001, 10, true, 2d, false);

Upvotes: 2

Related Questions