Kmeans - group by

Question

I want to do kmeans labels for numClusters = 6 so that I can group by the labels later.

How do I select the columns to do kmeans on?

val clusterThis =  scaledDF.select($"id",$"setting1",$"setting2",$"setting3")

// dataset description lists six operation modes 
val operatingModes = 6 

// Cluster the data into two classes using KMeans
val numClusters = operatingModes
val numIterations = 20

import sqlContext.implicits._
val clusters = KMeans.train(clusterThis.rdd, numClusters, numIterations)
clusters.predict(clusterThis) 

//... join back on id

Alberto Bonsanto · Accepted Answer

As you can see in KMeans's Example the object uses just one column as features. In that example and by coincidence it has the same name. However, that name depends on you, but the important thing is that this column must be a Vector (dense or sparse).

Thus, you would need to combine your features (different columns) into one, for this task you can use a VectorAssembler.

By the way, K-means doesn't work with categorical features. You can read this post K-means clustering for mixed numeric and categorical data to notice the reasons.

Kmeans - group by

Answers (1)

Related Questions