How to use modified Spark MLlib module as dependency?

Question

I want to build a Spark application Jar. My expectation is: when I execute the jar by ./spark-submit, the application will utilize my own built mllib(ex:spark-mllib_2.11-2.2.0-SNAPSHOT.jar).

This's my build.sbt:

name:="SoftmaxMNIST"
version := "1.0"
scalaVersion := "2.11.4"
unmanagedJars in Compile += file("lib/spark-mllib_2.11-2.2.0-SNAPSHOT.jar")

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" %% "spark-sql" % "2.1.0 
)

// META-INF discarding
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}
}

I already dropped my own-built spark-mllib_2.11-2.2.0-SNAPSHOT.jar in to /My-Project-Path/lib/ directory. But it does not work. It seems that the application is still using the Spark's default mllib jar, in my case it is in PATH/spark-2.1.0-bin-hadoop2.7/jars/ directory

PS: The ultimate purpose is that when I run my application on AWS EC2, my application is always using my own-built mllib instead of the default one. I may modify my own mllib frequently.

Can anyone help me solve this. Thanks in advance!

Jacek Laskowski · Accepted Answer

The answer depends on how you do spark-submit. You have to "convince" (aka modify) spark-submit to see the modified jar (not the one in SPARK_HOME).

The quickest (not necessarily easiest in the long run) approach would be to include the Spark jars, including the one you've modified, in your uberjar (aka fat jar). You seem to be using sbt-assembly plugin in your sbt project so it's just a matter of publishLocal the dependency (or putting in into lib directory) and add it to libraryDependencies in your project. assemble will do the rest.

That will however give you a really huge and fat jar that while in heavy development cycle with lots of compilation, testing and deployment could make the process terribly slow.

The other approach is to use your custom Apache Spark (with the modified library for Spark MLlib included). After you mvn install you'll have your custom Spark ready to use. Use spark-submit from the custom version and it's supposed to work. You don't have to include the jar in your fat jar and perhaps you won't have to use sbt-assembly plugin whatsoever (just a mere sbt package should work).

That approach has the benefit of making your deployable Spark application package smaller and the custom Spark stay separate from development process. Use an internal library repository to publish and depend on.

How to use modified Spark MLlib module as dependency?

Answers (1)

Related Questions