Reputation: 24561
If I have a custom library (coded in Scala but it internally calls native libraries via JNI), what is a way to deploy it to Apache Spark worker nodes so it can be used by other applications in the cluster? Basically, I want to extend Spark with my custom functionality so that any job can use it.
As far as I understand, spark-submit is for submitting jobs, so that is not what I want.
If I pack the native libraries in a jar, is Context.addJar()
going to do the trick? I would have to unpack the native libraries at runtime to some temp directory for that to work - is it even an option in Spark environment?
Thanks in advance.
Upvotes: 2
Views: 2268
Reputation: 13927
spark-submit
takes a couple of parameters that are of interest. --packages
and --jars
. You can add your custom .jar
to the --jars
one. You can pass maven coordinates to --packages
. Something like:
spark-submit ... --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.apache.kafka:kafka_2.10:0.8.2.0 --jars localpath/to/your.jar
These work in the spark-shell
too, so you can deploy your custom jar
files and any external dependencies when you use the REPL
.
If you have a particularly large jar
file, you can use SparkContext.addJar
to add it to the context. However, this is a pain to maintain. To really do it efficiently, you would need to deploy the JAR file to HDFS, and ensure that HDFS replicates it amongst all of your nodes -- if HDFS only has the JAR file on one node, you are right back where you started. And then what do you do about version control? If you change the JAR
file, most likely you need to keep the old one around in case any jobs were coded against it, so you will need to have multiple versions in HDFS. Are you going to have the other jobs recompile to use a new version? The nice thing about --packages
and --jars
is that the mess of all of that is handled for you.
But assuming your custom JAR is big enough to warrant that, yes you can include it via SparkContext.addJar
, however, like I said -- it's not the standard way to do it. Even semi-core extensions to Spark, like spark-streaming-kafka
, are delivered via the --packages
option.
Upvotes: 3