Reputation: 972
I want to know what is the ID associated with the Cluster Centers. model.transform(dataset)
will assign a predicted cluster ID to my data points, and model.clusterCenters.foreach(println)
will print these cluster centers, but I cannot figure out how to associate the cluster centers with their ID.
import org.apache.spark.ml.clustering.KMeans
// Loads data.
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
val prediction = model.transform(dataset)
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
Ideally, I want an output such as:
|I.D |cluster center |
==========================
|0 |[0.0,...,0.3] |
|2 |[1.0,...,1.3] |
|1 |[2.0,...,1.3] |
|3 |[3.0,...,1.3] |
It does not seem to me that the println order is sorted by ID. I tried converting model.clusterCenters
into a DF to transform()
on it, but I couldn't figure out how to convert Array[org.apache.spark.ml.linalg.Vector]
to org.apache.spark.sql.Dataset[_]
Upvotes: 0
Views: 445
Reputation: 4010
Once you saved the data it will write cluster_id and Cluster_center. You can read the file, can see the desired output
model.save(sc, "/user/hadoop/kmeanModel")
val parq = sqlContext.read.parquet("/user/hadoop/kmeanModel/data/*")
parq.collect.foreach(println)
Upvotes: 1