schoon
schoon

Reputation: 3324

SparkException occurs while serializing the StringBuilder in Scala

I am trying to store a StringBuilder (for later printing) in Spark/Scala. I try:

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics


// Instantiate metrics object
val metrics = new BinaryClassificationMetrics(predictionAndLabel)

// Precision by threshold
val precision = metrics.precisionByThreshold
val precisionString = new StringBuilder
precision.foreach { case (t, p) =>
  println(s"Threshold: $t, Precision: $p")
  precisionString ++="Threshold: "+ t +  "Precision: " +p 

}

which gives the error:

SparkException: Task not serializable

The only possible solution I can find (create serializable Lambda function) is:

val serializableStringBuilder = () => new StringBuilder
precision.foreach { case (t, p) =>
  println(s"Threshold: $t, Precision: $p")
  serializableStringBuilder ++="Threshold: "+ t +  "Precision: " +p 

}

which gives the error:

<console>:113: error: value ++= is not a member of () => StringBuilder

How do I adapt the definition of the lambda function, or is there a better way of doing this?

Upvotes: 0

Views: 301

Answers (1)

Nikolay Vasiliev
Nikolay Vasiliev

Reputation: 6066

You should drop StringBuilder in favor of simple .map() of strings with later join. As StringBuilder docs say:

This class is designed for use as a drop-in replacement for StringBuffer in places where the string buffer was being used by a single thread (as is generally the case).

Which means that this is the most efficient way to compose Java strings in single-threaded environment. They are not meant to be sent over network (which any map-reduce task on Spark will do).

Your code might look like:

val thresholdPresisionRdd[RDD[List[String]] = precision.map { case (t, p) =>
  "Threshold: "+ t +  "Precision: " +p 
}
val precisionReport = thresholdPresisionRdd.collect().toList.mkString("\n")

println(precisionReport)

Please keep in mind that .collect() will transfer entire RDD to the driver, which might be an issue with big data sets (I assume that in this case it should not be big).

Upvotes: 1

Related Questions