Maximiliano Felice
Maximiliano Felice

Reputation: 367

Serializing Scalaz Order for Spark

I've noticed that most of Scalaz classes are not serializable. In this case, I'm trying to use a type class for custom-sorting an array in Spark.

A reduce example might be something like this:

> val ord = Order[T]{ ... } 
> sc.makeRDD[T](...).grupBy(...).map { 
    case (_, grouped) => IList[T](grouped.toList).sorted(ord).distinct(ord)
  }

As you should expect, this implementation throws a NotSerializableException because Order[T] is not serializable.

Is there any way to make Order[T] serializable? In a perfect world, I would wish to avoid this problem still using scalaz. In a not-so-perfect one, I'm open to consider other implementations.

Should that happen, it's mandatory to keep the custom sorting and distinct implementations in a mantainable and extensible way.

Upvotes: 2

Views: 141

Answers (1)

evan.oman
evan.oman

Reputation: 5572

If you need access to some non-serializable object you can wrap it in an object:

scala> class NotSerializablePrinter { def print(msg:String) = println(msg) }
defined class NotSerializablePrinter

scala> val printer = new NotSerializablePrinter
printer: NotSerializablePrinter = $iwC$$iwC$NotSerializablePrinter@3b8afdbf

scala> val rdd = sc.parallelize(Array("1","2","3"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[24] at parallelize at <console>:30

scala> rdd.foreach(msg => printer.print(msg)) // Fails
org.apache.spark.SparkException: Task not serializable
...

scala> object wrap { val printer = new NotSerializablePrinter }
defined module wrap

scala> rdd.foreach(msg => wrap.printer.print(msg))
1
3
2

In your case you would replace my NotSerializablePrinter instance with your Scalaz Order instance. This example is pulled from this useful article(item 3a).

Upvotes: 6

Related Questions