pikkvile
pikkvile

Reputation: 2521

Apache Spark and non-serializable application context

I'm new in Spark.

I want to parallelize my computations using Spark and map-reduce approach. But this computations, which I put into PairFunction implementation for the Map stage, requres some context to be initialized. This context includes several singleton objects from the 3rd party jar, and this objects are not serializable, so I can not spread them across worker nodes and can not use them in my PairFunction.

So my question is: can I somehow parallelize job which requires non-serializable context using Apache Spark? Are there any other solutions? Maybe I can somehow tell Spark to initialize required context on every worker node?

Upvotes: 3

Views: 695

Answers (1)

Wilson Liao
Wilson Liao

Reputation: 638

You can try initializing your 3rd party jar in executor by using mapPartition or foreachPartition .

rdd.foreachPartition { iter =>
  //initialize here
  val object = new XXX()
  iter.foreach { p =>
    //then you can use object here
  }
}

Upvotes: 2

Related Questions