Reputation: 186
I have a rather heavy metal library which is accessed via a Java/JNI connector and which I would like to use via Spark in a cluster. However, the library needs to be initialized before first usage and this takes about 30 seconds. So my question is whether Spark has some mechanism to preinitialize such libraries at the beginning of a job so that this init overhead is not necessary for the actual usage?
Upvotes: 3
Views: 227
Reputation: 29227
question is whether Spark has some mechanism to pre-initialize such libraries at the beginning of a job so that this init overhead is not necessary for the actual usage?
AFAIK & as of now ... there is no such facility given by Spark see SPARK-650 like sc.init...
However if you want to pre-initialize with RDD
tranformation then taken a empty RDD
before running job and and you can reset and/or initialize the cluster...
map-reduce has setup
and clean
methods to initialize and cleanup...
you can use the same way of converting a map-reduce style code to spark for example:
Note : empty RDD can be re-partitioned. So I am thinking below is the way if its transformation if you are using action then you can use
foreachPartition
mapPartitions
example :
val rdd = sc.emptyRDD.repartition(sc.defaultParallelism)
rdd.mapPartitions(partition ->
setup() //library initialization like map-reduce setup method
partition.map( item =>
val output = process(item) // your logic here.
if (!partition.hasNext) {
// cleanup here
}
)
)
foreachPartition
example :
if (rdd.isEmpty) {
rdd.repartition(sc.defaultParallelism).foreachPartition(_ => yourUtils.initOnce())
}
mapartitions
(transformation) and foreachPartition
(action) could be useful in the above explained example way.pls check this... http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
Also see my answer with detailed explanation of mappartitions
Apache Spark: map vs mapPartitions & spark-foreach-vs-foreachpartitions
Upvotes: 1