How to deploy custom libraries to Apache Spark?

Question

If I have a custom library (coded in Scala but it internally calls native libraries via JNI), what is a way to deploy it to Apache Spark worker nodes so it can be used by other applications in the cluster? Basically, I want to extend Spark with my custom functionality so that any job can use it.

As far as I understand, spark-submit is for submitting jobs, so that is not what I want.

If I pack the native libraries in a jar, is Context.addJar() going to do the trick? I would have to unpack the native libraries at runtime to some temp directory for that to work - is it even an option in Spark environment?

Thanks in advance.

David Griffin · Accepted Answer

spark-submit takes a couple of parameters that are of interest. --packages and --jars. You can add your custom .jar to the --jars one. You can pass maven coordinates to --packages. Something like:

spark-submit ... --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.apache.kafka:kafka_2.10:0.8.2.0 --jars localpath/to/your.jar

These work in the spark-shell too, so you can deploy your custom jar files and any external dependencies when you use the REPL.

If you have a particularly large jar file, you can use SparkContext.addJar to add it to the context. However, this is a pain to maintain. To really do it efficiently, you would need to deploy the JAR file to HDFS, and ensure that HDFS replicates it amongst all of your nodes -- if HDFS only has the JAR file on one node, you are right back where you started. And then what do you do about version control? If you change the JAR file, most likely you need to keep the old one around in case any jobs were coded against it, so you will need to have multiple versions in HDFS. Are you going to have the other jobs recompile to use a new version? The nice thing about --packages and --jars is that the mess of all of that is handled for you.

But assuming your custom JAR is big enough to warrant that, yes you can include it via SparkContext.addJar, however, like I said -- it's not the standard way to do it. Even semi-core extensions to Spark, like spark-streaming-kafka, are delivered via the --packages option.

How to deploy custom libraries to Apache Spark?

Answers (1)

Related Questions