Ashkan
Ashkan

Reputation: 1673

Using Class Methods in Spark RDD Operations Returns Task not serializable Exception

Suppose I have the following class in Spark Scala:

class SparkComputation(i: Int, j: Int) {
  def something(x: Int, y: Int) = (x + y) * i

  def processRDD(data: RDD[Int]) = {
    val j = this.j
    val something = this.something _
    data.map(something(_, j))
  }
}

I get the Task not serializable Exception when I run the following code:

val s = new SparkComputation(2, 5)
val data = sc.parallelize(0 to 100)
val res = s.processRDD(data).collect

I'm assuming that the exception occurs because Spark is trying to serialize the SparkComputation instance. To prevent this from happening, I have stored the class members I'm using in the RDD operation in local variables (j and something). However, Spark still tries to serialize SparkComputation object because of the method. Is there anyway to pass the class method to map without forcing Spark to serializing the whole SparkComputation class? I know the following code works without any problem:

def processRDD(data: RDD[Int]) = {
    val j = this.j
    val i = this.i
    data.map(x => (x + j) * i)
  }

So, the class members that store values are not causing the problem. The problem is with the function. I have also tried the following approach with no luck:

class SparkComputation(i: Int, j: Int) {
  def processRDD(data: RDD[Int]) = {
    val j = this.j
    val i = this.i
    def something(x: Int, y: Int) = (x + y) * i
    data.map(something(_, j))
  }
}

Upvotes: 0

Views: 340

Answers (1)

user6022341
user6022341

Reputation:

Make the class serializable:

class SparkComputation(i: Int, j: Int) extends Serializable {
  def something(x: Int, y: Int) = (x + y) * i

  def processRDD(data: RDD[Int]) = {
    val j = this.j
    val something = this.something _
    data.map(something(_, j))
  }
}

Upvotes: 1

Related Questions