Klingon
Klingon

Reputation: 186

Can Spark preinitialize heavy duty third party libraries?

I have a rather heavy metal library which is accessed via a Java/JNI connector and which I would like to use via Spark in a cluster. However, the library needs to be initialized before first usage and this takes about 30 seconds. So my question is whether Spark has some mechanism to preinitialize such libraries at the beginning of a job so that this init overhead is not necessary for the actual usage?

Upvotes: 3

Views: 227

Answers (1)

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29227

question is whether Spark has some mechanism to pre-initialize such libraries at the beginning of a job so that this init overhead is not necessary for the actual usage?

AFAIK & as of now ... there is no such facility given by Spark see SPARK-650 like sc.init...


However if you want to pre-initialize with RDD tranformation then taken a empty RDD before running job and and you can reset and/or initialize the cluster...

map-reduce has setup and clean methods to initialize and cleanup... you can use the same way of converting a map-reduce style code to spark for example:

Note : empty RDD can be re-partitioned. So I am thinking below is the way if its transformation if you are using action then you can use foreachPartition

mapPartitions example :

val rdd = sc.emptyRDD.repartition(sc.defaultParallelism)    
rdd.mapPartitions(partition -> 
        setup() //library initialization like map-reduce setup method
        partition.map( item => 
            val output = process(item) // your logic here.
            if (!partition.hasNext) {
               // cleanup here
            }
        )
    )

foreachPartition example :

 if (rdd.isEmpty) {
      rdd.repartition(sc.defaultParallelism).foreachPartition(_ => yourUtils.initOnce())
    }

NOTE: mapartitions(transformation) and foreachPartition(action) could be useful in the above explained example way.

Upvotes: 1

Related Questions