Java Padawan
Java Padawan

Reputation: 51

Spark 2.1.0 - SparkML requirement failed

I was playing around with Spark 2.1.0 Kmeans - Clustering algorithm.

public class ClusteringTest {
    public static void main(String[] args) {
        SparkSession session = SparkSession.builder()
                .appName("Clustering Test")
                .config("spark.master", "local")
                .getOrCreate();
        session.sparkContext().setLogLevel("ERROR");

        List<Row> rawDataTraining = Arrays.asList(
                RowFactory.create(1.0,Vectors.dense( 1.0, 1.0, 1.0).toSparse()),
                RowFactory.create(1.0,Vectors.dense(2.0, 2.0, 2.0).toSparse()),
                RowFactory.create(1.0,Vectors.dense(3.0, 3.0, 3.0).toSparse()),

                RowFactory.create(2.0,Vectors.dense(6.0, 6.0, 6.0).toSparse()),
                RowFactory.create(2.0,Vectors.dense(7.0, 7.0, 7.0).toSparse()),
                RowFactory.create(2.0,Vectors.dense(8.0, 8.0,8.0).toSparse()),
//...
        StructType schema = new StructType(new StructField[]{

                new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
                new StructField("features", new VectorUDT(), false, Metadata.empty())
        });

        Dataset<Row> myRawData = session.createDataFrame(rawDataTraining, schema);
        Dataset<Row>[] splits = myRawData.randomSplit(new double[]{0.75, 0.25});
        Dataset<Row> trainingData = splits[0];
        Dataset<Row> testData = splits[1];

        //Train Kmeans
        KMeans kMeans = new KMeans().setK(3).setSeed(100);
        KMeansModel kMeansModel = kMeans.fit(trainingData);
        Dataset<Row> predictions = kMeansModel.transform(testData);
        predictions.show(false);
        MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator().setLabelCol("label")
                .setPredictionCol("prediction")
                .setMetricName("accuracy");
        double accuracy = evaluator.evaluate(predictions);
        System.out.println("accuracy" + accuracy);
    }
}

The console outputs are:

+-----+----------------------------+----------+
|label|features                    |prediction|
+-----+----------------------------+----------+
|2.0  |(3,[0,1,2],[7.0,7.0,7.0])   |2         |
|3.0  |(3,[0,1,2],[11.0,11.0,11.0])|2         |
|3.0  |(3,[0,1,2],[12.0,12.0,12.0])|1         |
|3.0  |(3,[0,1,2],[13.0,13.0,13.0])|1         |
+-----+----------------------------+----------+

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column prediction must be of type DoubleType but was actually IntegerType.
    at scala.Predef$.require(Predef.scala:233)
    at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
    at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:75)
    at ClusteringTest.main(ClusteringTest.java:84)

Process finished with exit code 1

As u can see, the prediction results are Integers. But to use the MulticlassClassificationEvalutor I need these prediction results converted to Double. How can I do it?

Upvotes: 0

Views: 604

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

TL;DR It is not the way to go.

KMeans is unsupervised method and cluster identifiers you get are arbitrary (ids of clusters could be permuted) and are not related to the label column. As a result using MulticlassClassificationEvaluator to compare existing label and output from KMeans doesn't make any sense.

You should rather use some supervised classifier, like multinomial logistic regression or Naive Bayes.

If you want to stick with KMeans please use appropriate quality metric, like the one returned by computeCost, but this will completely ignore label information.

Upvotes: 1

Related Questions