ELKI - k-means clustering.

Question

I' like to run ELKI k-means clustering in command line.

It seems that running time is too short compared with R programming. I tried to run k-means clustering in R, then It took about 100 seconds. Moreover, there is no change among k=5, k=10 and so on.

file.tsv has 60,000 rows and 25 columns.

START=$(date +%s)
k=5
java -jar elki.jar KDDCLIApplication \
     -dbc.in "file.tsv" \
         -dbc.parser NumberVectorLabelParser \
         -parser.colsep "	" \
     -algorithm clustering.kmeans.KMeansLloyd \
         -kmeans.k $k \
         -kmeans.initialization KMeansPlusPlusInitialMeans \
         -kmeans.maxiter 9999 \
     -resulthandler ResultWriter -out.gzip false \
         -out output/k-$k \

END=$(date +%s)
DIFF=$(( $END - $START ))
echo "It took $DIFF seconds"

The output is "It took 5 seconds"

START=$(date +%s)
k=10
java -jar elki.jar ...
     ...

END=$(date +%s)
DIFF=$(( $END - $START ))
echo "It took $DIFF seconds"

This case k=10 is also "It took 5 seconds".

Why is there no change among cluster sizes? Does the code have some problems?

Erich Schubert · Accepted Answer

5 seconds is probably too fast to get you a reliable measurement.

Furthermore, with a larger k the result may converge with fewer iterations.

You may want to use -time to see the time needed to run the actual algorithm. With your method, parsing and writing will have a non-neglible impact. -resulthandler DiscardResultHandler will disable output, which is also reasonable for benchmarking.

You don't need to set -kmeans.maxiter 9999. By default, ELKI will run k-means until convergence.

I believe there was an implementation weakness in k-Means++ initialization that made it more costly than necessary. Maybe initialization, parsing, writing, contributes a lot to your total runtime.

Also try using -resulthandler LogResultStructureResultHandler -verbose to make sure the parser understood the file as intended (check the dimensionality!) With -verbose you can also check that it does a reasonable number of iterations.

ELKI - k-means clustering.

Answers (1)

Related Questions