Reputation: 9169
Still trying to get familiar with maven and compiling my source code into jar files for spark-submit. I know how to use IntelliJ for this but would like to understand how this actually works. I have an EC2 server with all of the latest software such as spark and scala already installed and have the example SparkPi.scala source code I would like to now compile with maven. My silly questions are firstly, can I just use my installed software for building the code rather than retrieving dependencies from maven repository and how do I start off with a basic pom.xml template for adding the appropriate requirements. I don't fully understand what maven is exactly doing and how can I just test a compilation for my source code?
As I understand it, I just need to have the standard directory structure src/main/scala
and then want to run mvn package
. Also I would like to test with maven rather than sbt.
Upvotes: 1
Views: 3851
Reputation: 509
Addition to @Krishna,
If you have mvn project
, use mvn clean package
on pom.xml
. Make sure you have the following build
in your pom.xml
to make fat-jar
. (This is my case, how I'm making jar)
<build><sourceDirectory>src</sourceDirectory>
<plugins><plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>assemble-all</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin></plugins>
</build>
For more detail: link
If you have sbt project
, use sbt clean assembly
to make fat-jar
. For that you need the following config, as an example in build.sbt
assemblyJarName := "WordCountSimple.jar"
//
val meta = """META.INF(.)*""".r
assemblyMergeStrategy in assembly := {
case PathList("javax", "servlet", xs@_*) => MergeStrategy.first
case PathList(ps@_*) if ps.last endsWith ".html" => MergeStrategy.first
case n if n.startsWith("reference.conf") => MergeStrategy.concat
case n if n.endsWith(".conf") => MergeStrategy.concat
case meta(_) => MergeStrategy.discard
case x => MergeStrategy.first
}
Also plugin.sbt
like:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
Till here main goal is to get fat-jar with all dependencies in target folder. Use that jar to run in cluster like this:
hastimal@nm:/usr/local/spark$ ./bin/spark-submit --class com.hastimal.wordcount --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --driver-cores 7 --conf spark.default.parallelism=105 --conf spark.driver.maxResultSize=4g --conf spark.network.timeout=300 --conf spark.yarn.executor.memoryOverhead=4608 --conf spark.yarn.driver.memoryOverhead=4608 --conf spark.akka.frameSize=1200 --conf spark.io.compression.codec=lz4 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf spark.shuffle.spill.compress=true --conf spark.shuffle.compress=true --conf spark.shuffle.manager=sort /users/hastimal/wordcount.jar inputRDF/data_all.txt /output
Here I have inputRDF/data_all.txt /output
are two args. Also in tool point of view I'm building in Intellij
as IDE.
Upvotes: 1
Reputation: 1702
Please follow the steps below
# create assembly jar upon code change
sbt assembly
# transfer the jar to a cluster
scp target/scala-2.10/myproject-version-assembly.jar <some location in your cluster>
# fire spark-submit on your cluster
$SPARK_HOME/bin/spark-submit --class not.memorable.package.applicaiton.class --master yarn --num-executor 10 \
--conf some.crazy.config=xyz --executor-memory=lotsG \
myproject-version-assembly.jar \
<glorious-application-arguments...>
Upvotes: 0