Reputation: 2521
I'm new in Spark.
I want to parallelize my computations using Spark and map-reduce approach. But this computations, which I put into PairFunction implementation for the Map stage, requres some context to be initialized. This context includes several singleton objects from the 3rd party jar, and this objects are not serializable, so I can not spread them across worker nodes and can not use them in my PairFunction.
So my question is: can I somehow parallelize job which requires non-serializable context using Apache Spark? Are there any other solutions? Maybe I can somehow tell Spark to initialize required context on every worker node?
Upvotes: 3
Views: 695
Reputation: 638
You can try initializing your 3rd party jar in executor by using mapPartition
or foreachPartition
.
rdd.foreachPartition { iter =>
//initialize here
val object = new XXX()
iter.foreach { p =>
//then you can use object here
}
}
Upvotes: 2