Reputation: 617
I' like to run ELKI k-means clustering in command line.
It seems that running time is too short compared with R programming. I tried to run k-means clustering in R, then It took about 100 seconds. Moreover, there is no change among k=5, k=10 and so on.
file.tsv
has 60,000 rows and 25 columns.
START=$(date +%s)
k=5
java -jar elki.jar KDDCLIApplication \
-dbc.in "file.tsv" \
-dbc.parser NumberVectorLabelParser \
-parser.colsep "\t" \
-algorithm clustering.kmeans.KMeansLloyd \
-kmeans.k $k \
-kmeans.initialization KMeansPlusPlusInitialMeans \
-kmeans.maxiter 9999 \
-resulthandler ResultWriter -out.gzip false \
-out output/k-$k \
END=$(date +%s)
DIFF=$(( $END - $START ))
echo "It took $DIFF seconds"
The output is "It took 5 seconds"
START=$(date +%s)
k=10
java -jar elki.jar ...
...
END=$(date +%s)
DIFF=$(( $END - $START ))
echo "It took $DIFF seconds"
This case k=10
is also "It took 5 seconds"
.
Why is there no change among cluster sizes? Does the code have some problems?
Upvotes: 2
Views: 773
Reputation: 8725
5 seconds is probably too fast to get you a reliable measurement.
Furthermore, with a larger k
the result may converge with fewer iterations.
You may want to use -time
to see the time needed to run the actual algorithm. With your method, parsing and writing will have a non-neglible impact. -resulthandler DiscardResultHandler
will disable output, which is also reasonable for benchmarking.
You don't need to set -kmeans.maxiter 9999
. By default, ELKI will run k-means until convergence.
I believe there was an implementation weakness in k-Means++ initialization that made it more costly than necessary. Maybe initialization, parsing, writing, contributes a lot to your total runtime.
Also try using -resulthandler LogResultStructureResultHandler -verbose
to make sure the parser understood the file as intended (check the dimensionality!) With -verbose
you can also check that it does a reasonable number of iterations.
Upvotes: 1