oluies
oluies

Reputation: 17831

Kmeans - group by

I want to do kmeans labels for numClusters = 6 so that I can group by the labels later.

How do I select the columns to do kmeans on?

val clusterThis =  scaledDF.select($"id",$"setting1",$"setting2",$"setting3")

// dataset description lists six operation modes 
val operatingModes = 6 

// Cluster the data into two classes using KMeans
val numClusters = operatingModes
val numIterations = 20

import sqlContext.implicits._
val clusters = KMeans.train(clusterThis.rdd, numClusters, numIterations)
clusters.predict(clusterThis) 

//... join back on id 

Upvotes: 2

Views: 1169

Answers (1)

Alberto Bonsanto
Alberto Bonsanto

Reputation: 18042

As you can see in KMeans's Example the object uses just one column as features. In that example and by coincidence it has the same name. However, that name depends on you, but the important thing is that this column must be a Vector (dense or sparse).

Thus, you would need to combine your features (different columns) into one, for this task you can use a VectorAssembler.

By the way, K-means doesn't work with categorical features. You can read this post K-means clustering for mixed numeric and categorical data to notice the reasons.

Upvotes: 3

Related Questions