Reputation: 1398
I want to select k-means model in terms of 'k' parameter based on the lowest k-means score.
I can find find optimal value of 'k' parameter by hand, writing something like
def clusteringScore0(data: DataFrame, k: Int): Double = {
val assembler = new VectorAssembler().
setInputCols(data.columns.filter(_ != "label")).
setOutputCol("featureVector")
val kmeans = new KMeans().
setSeed(Random.nextLong()).
setK(k).
setPredictionCol("cluster").
setFeaturesCol("featureVector")
val pipeline = new Pipeline().setStages(Array(assembler, kmeans))
val kmeansModel = pipeline.fit(data).stages.last.asInstanceOf[KMeansModel]
kmeansModel.computeCost(assembler.transform(data)) / data.count() }
(20 to 100 by 20).map(k => (k, clusteringScore0(numericOnly, k))).
foreach(println)
Something like this:
val paramGrid = new ParamGridBuilder().addGrid(kmeansModel.k, 20 to 100 by 20).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new KMeansEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(3)
There are Evaluators for regression and classification, but no Evaluator for clustering.
So I should implement Evaluator interface. I am stuck with evaluate
method.
class KMeansEvaluator extends Evaluator {
override def copy(extra: ParamMap): Evaluator = defaultCopy(extra)
override def evaluate(data: Dataset[_]): Double = ??? // should I somehow adapt code from KMeansModel.computeCost()?
override val uid = Identifiable.randomUID("cost_evaluator")
}
Upvotes: 4
Views: 3807
Reputation: 2095
Hi ClusteringEvaluator
is available from Spark 2.3.0. You can use to find optimal k values by including ClusteringEvaluator object into your for-loop. You can also find more detail for silhouette analysis in Scikit-learn page. In short, the score should be between [-1,1], the larger score is the better. I have modified a for loop below for your codes.
import org.apache.spark.ml.evaluation.ClusteringEvaluator
val evaluator = new ClusteringEvaluator()
.setFeaturesCol("featureVector")
.setPredictionCol("cluster")
.setMetricName("silhouette")
for(k <- 20 to 100 by 20){
clusteringScore0(numericOnly,k)
val transformedDF = kmeansModel.transform(numericOnly)
val score = evaluator.evaluate(transformedDF)
println(k,score,kmeansModel.computeCost(transformedDF))
}
Upvotes: 6