Reputation: 3653
I'm reading that i can create mahout vectors from a lucene index that can be used to apply the mahout clustering algorithms. http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
I would like to apply K-means clustering algorithm in the documents in my Lucene index, but it is not clear how can i apply this algorithm (or hierarchical clustering) to extract meaningful clusters with these documents.
In this page http://cwiki.apache.org/confluence/display/MAHOUT/k-Means says that the algorithm accepts two input directories: one for the data points and one for the initial clusters. My data points are the documents? How can i "declare" that these are my documents (or their vectors) , simply take them and do the clustering?
sorry in advance for my poor grammar
Thank you
Upvotes: 12
Views: 7411
Reputation: 9252
A pretty good howto is here: integrating apache mahout with apache lucene
Upvotes: 1
Reputation: 21
@ maiky You can read more about reading the output and using clusterdump utility in this page -> https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper
Upvotes: 0
Reputation: 5042
If you have vectors, you can run KMeansDriver. Here is the help for the same.
Usage:
[--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
--input (-i) input The Path for input Vectors. Must be a
SequenceFile of Writable, Vector
--clusters (-c) clusters The input centroids, as Vectors. Must be a
SequenceFile of Writable, Cluster/Canopy.
If k is also specified, then a random set
of vectors will be selected and written out
to this path first
--output (-o) output The Path to put the output in
--distance (-m) distance The Distance Measure to use. Default is
SquaredEuclidean
--convergence (-d) convergence The threshold below which the clusters are
considered to be converged. Default is 0.5
--max (-x) max The maximum number of iterations to
perform. Default is 20
--numReduce (-r) numReduce The number of reduce tasks
--k (-k) k The k in k-Means. If specified, then a
random selection of k Vectors will be
chosen as the Centroid and written to the
clusters output path.
--vectorClass (-v) vectorClass The Vector implementation class name.
Default is SparseVector.class
--overwrite (-w) If set, overwrite the output directory
--help (-h) Print out help
Update: Get the result directory from HDFS to local fs. Then use ClusterDumper utility to get the cluster and list of documents in that cluster.
Upvotes: 3