Spark 2.0 - How to obtain Cluster ID associated with Cluster Center

Question

I want to know what is the ID associated with the Cluster Centers. model.transform(dataset) will assign a predicted cluster ID to my data points, and model.clusterCenters.foreach(println) will print these cluster centers, but I cannot figure out how to associate the cluster centers with their ID.

import org.apache.spark.ml.clustering.KMeans

// Loads data.
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
val prediction = model.transform(dataset)

// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)

Ideally, I want an output such as:

|I.D     |cluster center |
==========================
|0       |[0.0,...,0.3]  |
|2       |[1.0,...,1.3]  |
|1       |[2.0,...,1.3]  |
|3       |[3.0,...,1.3]  |

It does not seem to me that the println order is sorted by ID. I tried converting model.clusterCenters into a DF to transform() on it, but I couldn't figure out how to convert Array[org.apache.spark.ml.linalg.Vector] to org.apache.spark.sql.Dataset[_]

loneStar · Accepted Answer

Once you saved the data it will write cluster_id and Cluster_center. You can read the file, can see the desired output

    model.save(sc, "/user/hadoop/kmeanModel")
    val parq = sqlContext.read.parquet("/user/hadoop/kmeanModel/data/*")
    parq.collect.foreach(println)

Spark 2.0 - How to obtain Cluster ID associated with Cluster Center

Answers (1)

Related Questions