Reputation: 17831
I want to do kmeans labels for numClusters = 6
so that I can group by the labels later.
How do I select the columns to do kmeans on?
val clusterThis = scaledDF.select($"id",$"setting1",$"setting2",$"setting3")
// dataset description lists six operation modes
val operatingModes = 6
// Cluster the data into two classes using KMeans
val numClusters = operatingModes
val numIterations = 20
import sqlContext.implicits._
val clusters = KMeans.train(clusterThis.rdd, numClusters, numIterations)
clusters.predict(clusterThis)
//... join back on id
Upvotes: 2
Views: 1169
Reputation: 18042
As you can see in KMeans's Example the object uses just one column as features
. In that example and by coincidence it has the same name. However, that name depends on you, but the important thing is that this column must be a Vector
(dense or sparse).
Thus, you would need to combine your features (different columns) into one, for this task you can use a VectorAssembler.
By the way, K-means doesn't work with categorical features. You can read this post K-means clustering for mixed numeric and categorical data to notice the reasons.
Upvotes: 3