Reputation: 1704
I have seen that there is a new implementation of K-Means in mahout called the Streaming-Kmeans, which achieves the k-means clustering without chained Mapper-Reducer cycles:
https://github.com/dfilimon/mahout/tree/epigrams
I am not finding any articles for the its usage anywhere. Could anyone point out any useful links for its usage, which have some code examples on how to use the same.
Upvotes: 3
Views: 2803
Reputation: 101
Here are the steps for running Streaming k-means:
mahout streamingkmeans -i "" -o "" --tempDir "" -ow -sc org.apache.mahout.math.neighborhood.FastProjectionSearch -k -km
-k = no of clusters -km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer
You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter.
Upvotes: 2
Reputation: 742
StreamingKMeans is a new feature in mahout .8. For more details, of its algorithms see: "Streaming k-means approximation" by N. Ailon, R. Jaiswal, C. Monteleoni http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf "Fast and Accurate k-means for Large Datasets" by M. Shindler, A. Wong, A. Meyerson, http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
As you mentioned, There is no article for its usage. As other version of clustering algorithm there is a Driver which you can pass some configuration parameters as a string array and it will cluster your data :
String[] args1 = new String[] {"-i","/home/name/workspace/XXXXX-vectors/tfidf-vectors","-o","/home/name/workspace/XXXXX-vectors/tfidf-vectors/SKM-Main-result/","--estimatedNumMapClusters","200","--searchSize","2","-k","12", "--numBallKMeansRuns","3", "--distanceMeasure","org.apache.mahout.common.distance.CosineDistanceMeasure"};
StreamingKMeansDriver.main(args1);
for get description of important parameters just do a mistake like "-iiii" as first parameter. it will show you the parameters , their descriptions and default values.
but if you don't want to use it in this way, just read StreamingKMeansMapper, StreamingKmeansReducer, StreamingKmeansThread, these 3 classes code help you understand the usage of algorithm and costumaize it for your need. Mapper use StreamingKMeans to produce estimated clusters of input data. for get k final cluster Reducer get intermediate points (the generated centroid in previous step) and by using ballKmeans it cluster these intermediate point to K cluster.
Upvotes: 3