Reputation: 51
I was playing around with Spark 2.1.0 Kmeans - Clustering algorithm.
public class ClusteringTest {
public static void main(String[] args) {
SparkSession session = SparkSession.builder()
.appName("Clustering Test")
.config("spark.master", "local")
.getOrCreate();
session.sparkContext().setLogLevel("ERROR");
List<Row> rawDataTraining = Arrays.asList(
RowFactory.create(1.0,Vectors.dense( 1.0, 1.0, 1.0).toSparse()),
RowFactory.create(1.0,Vectors.dense(2.0, 2.0, 2.0).toSparse()),
RowFactory.create(1.0,Vectors.dense(3.0, 3.0, 3.0).toSparse()),
RowFactory.create(2.0,Vectors.dense(6.0, 6.0, 6.0).toSparse()),
RowFactory.create(2.0,Vectors.dense(7.0, 7.0, 7.0).toSparse()),
RowFactory.create(2.0,Vectors.dense(8.0, 8.0,8.0).toSparse()),
//...
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("features", new VectorUDT(), false, Metadata.empty())
});
Dataset<Row> myRawData = session.createDataFrame(rawDataTraining, schema);
Dataset<Row>[] splits = myRawData.randomSplit(new double[]{0.75, 0.25});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];
//Train Kmeans
KMeans kMeans = new KMeans().setK(3).setSeed(100);
KMeansModel kMeansModel = kMeans.fit(trainingData);
Dataset<Row> predictions = kMeansModel.transform(testData);
predictions.show(false);
MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator().setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy");
double accuracy = evaluator.evaluate(predictions);
System.out.println("accuracy" + accuracy);
}
}
The console outputs are:
+-----+----------------------------+----------+
|label|features |prediction|
+-----+----------------------------+----------+
|2.0 |(3,[0,1,2],[7.0,7.0,7.0]) |2 |
|3.0 |(3,[0,1,2],[11.0,11.0,11.0])|2 |
|3.0 |(3,[0,1,2],[12.0,12.0,12.0])|1 |
|3.0 |(3,[0,1,2],[13.0,13.0,13.0])|1 |
+-----+----------------------------+----------+
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column prediction must be of type DoubleType but was actually IntegerType.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:75)
at ClusteringTest.main(ClusteringTest.java:84)
Process finished with exit code 1
As u can see, the prediction results are Integers. But to use the MulticlassClassificationEvalutor I need these prediction results converted to Double. How can I do it?
Upvotes: 0
Views: 604
Reputation: 35229
TL;DR It is not the way to go.
KMeans
is unsupervised method and cluster identifiers you get are arbitrary (ids of clusters could be permuted) and are not related to the label
column. As a result using MulticlassClassificationEvaluator
to compare existing label and output from KMeans
doesn't make any sense.
You should rather use some supervised classifier, like multinomial logistic regression or Naive Bayes.
If you want to stick with KMeans
please use appropriate quality metric, like the one returned by computeCost
, but this will completely ignore label information.
Upvotes: 1