Reputation: 31
I am a beginner in Mahout, I use Mahout 0.8 and followed the tutorial in https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html
When I use :
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -i testdata -o output -t1 20 -t2 50 -k 5 -x 20 -ow
then use clusterdump to extract the cluster-centers:
mahout clusterdump --input output/clusters-20-final --output /media/synthetic_control.center
in the synthetic_control.center file:
VL-585{n=50 c=[29.832, 29.589, 29.405, 28.516, 29.600, ….] r=[3.152, 3.518, 3.292, …]}
VL-591{n=197 c=[29.984, 29.681,…] r=[3.602, 3.558, 3.364,…]}
VL-595{n=203 c=[….] r=[….]}
VL-597{n=61 c=[….] r=[….]}
VL-599{n=43 c=[….] r=[….]}
VL-585{n=1 c=[….] r=[….]}
VL-591{n=27 c=[….] r=[….]}
VL-595{n=1 c=[….] r=[….]}
VL-597{n=1 c=[….] r=[….]}
VL-599{n=16 c=[….] r=[….]}
It seems the kmean generates 10 clusters, but my initial setting for k is 5.
I also tried other k, it always generate doubled clusters.
Can anyone help me with this? Thanks a lot!
Upvotes: 2
Views: 1185
Reputation: 31
haha! finally, after reading the code, I found the bug in mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job!!
here is the thing: in syntheticcontrol.kmeans.Job, if user set k, then the job will not run canopy clustering before kmeans, but run kmean directly. when run kmean, it needs initial centers for each cluster, so it use RandomSeedGenerator to random generate the each cluster center and put this file(part-randomSeed) to output/clusters-0 folder, after this kmean first use these centers to classify all the points and renew the cluster center and put these centers to output/clusters-0 folder. So, in clusters-0 folder, there are two set of centers!! Thus, first iteration will read doubled clusters! this is why this job always generate doubled cluster number!
solution: save part-randomSeed to another folder. in org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
line 142, Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);
change to Path clusters = new Path(output, "randomSeeds");
Upvotes: 1