Compiling Spark Scala Program into jar file using installed spark and maven

Question

Still trying to get familiar with maven and compiling my source code into jar files for spark-submit. I know how to use IntelliJ for this but would like to understand how this actually works. I have an EC2 server with all of the latest software such as spark and scala already installed and have the example SparkPi.scala source code I would like to now compile with maven. My silly questions are firstly, can I just use my installed software for building the code rather than retrieving dependencies from maven repository and how do I start off with a basic pom.xml template for adding the appropriate requirements. I don't fully understand what maven is exactly doing and how can I just test a compilation for my source code? As I understand it, I just need to have the standard directory structure src/main/scala and then want to run mvn package. Also I would like to test with maven rather than sbt.

ChikuMiku · Accepted Answer

Addition to @Krishna, If you have mvn project, use mvn clean package on pom.xml. Make sure you have the following build in your pom.xml to make fat-jar. (This is my case, how I'm making jar)

src
        
            maven-compiler-plugin
            3.0
            
                1.7
                1.7
            
        
            
            org.apache.maven.plugins
            maven-assembly-plugin
            2.4
            
                
                    jar-with-dependencies
                
            
            
                
                    assemble-all
                    package
                    
                        single

For more detail: link If you have sbt project, use sbt clean assemblyto make fat-jar. For that you need the following config, as an example in build.sbt

assemblyJarName := "WordCountSimple.jar"
//
val meta = """META.INF(.)*""".r

assemblyMergeStrategy in assembly := {
  case PathList("javax", "servlet", xs@_*) => MergeStrategy.first
  case PathList(ps@_*) if ps.last endsWith ".html" => MergeStrategy.first
  case n if n.startsWith("reference.conf") => MergeStrategy.concat
  case n if n.endsWith(".conf") => MergeStrategy.concat
  case meta(_) => MergeStrategy.discard
  case x => MergeStrategy.first
}

Also plugin.sbt like:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

For more see this and this.

Till here main goal is to get fat-jar with all dependencies in target folder. Use that jar to run in cluster like this:

hastimal@nm:/usr/local/spark$ ./bin/spark-submit --class  com.hastimal.wordcount --master yarn-cluster  --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g  --driver-cores 7 --conf spark.default.parallelism=105 --conf spark.driver.maxResultSize=4g --conf spark.network.timeout=300  --conf spark.yarn.executor.memoryOverhead=4608 --conf spark.yarn.driver.memoryOverhead=4608 --conf spark.akka.frameSize=1200  --conf spark.io.compression.codec=lz4 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf spark.shuffle.spill.compress=true --conf spark.shuffle.compress=true --conf spark.shuffle.manager=sort /users/hastimal/wordcount.jar inputRDF/data_all.txt /output

Here I have inputRDF/data_all.txt /output are two args. Also in tool point of view I'm building in Intellijas IDE.

Compiling Spark Scala Program into jar file using installed spark and maven

Answers (2)

Related Questions