Jonathan Garvey
Jonathan Garvey

Reputation: 151

Trouble passing application argument to spark-submit with scala

I'm pretty new to functional programming and don't have an imperative programming background. Running through some basic scala/spark tutorials online and having some difficulty submitting a Scala application through spark-submit.

In particular I'm getting a java.lang.ArrayIndexOutOfBounds 0 Exception, which I have researched and found out that the array element at position 0 is the culprit. Looking into it further, I saw that some basic debugging could tell me if the Main application was actually picking up the argument at runtime - which it was not. Here is the code:

import org.apache.spark.{SparkConf, SparkContext}

object SparkMeApp {
  def main(args: Array[String]) {

    try {
      //program works fine if path to file is hardcoded
      //val logfile = "C:\\Users\\garveyj\\Desktop\\NetSetup.log"
      val logfile = args(0)
      val conf = new SparkConf().setAppName("SparkMe Application").setMaster("local[*]")
      val sc = new SparkContext(conf)
      val logdata = sc.textFile(logfile, 2).cache()
      val numFound = logdata.filter(line => line.contains("found")).count()
      val numData = logdata.filter(line => line.contains("data")).count()
      println("")
      println("Lines with found: %s, Lines with data: %s".format(numFound, numData))
      println("")
    }
    catch {
      case aoub: ArrayIndexOutOfBoundsException => println(args.length)
    }
  }
}

To submit the application using spark-submit I use:

spark-submit --class SparkMeApp --master "local[*]" --jars target\scala-2.10\firstsparkapplication_2.10-1.0.jar NetSetup.log

...where NetSetup.log is in the same directory as where I'm submitting the application. The output of the application is simply: 0. If I remove the try/catch, the output is:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
        at SparkMeApp$.main(SparkMeApp.scala:12)
        at SparkMeApp.main(SparkMeApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

It's worth pointing out that the application runs fine if I remove the argument and hard code the path to the log file. Don't really know what I'm missing here. Any direction would be appreciated. Thanks in advance!

Upvotes: 0

Views: 4611

Answers (3)

Jonathan Garvey
Jonathan Garvey

Reputation: 151

--Problem solved-- I was making incorrect use of the spark-submit command. By removing '--jars' from the command, the Scala application argument was picked up by spark-submit.

Upvotes: 0

Knight71
Knight71

Reputation: 2959

You are doing spark-submit wrong. The actual command is

./spark-submit --class SparkMeApp --master "local[*]" \
example.jar examplefile.txt

You need to pass --jars only if there is external dependency and you want to distribute that jar to all executors.

If you had enabled the log4j.properties to INFO/WARN you could have easily caught it.

Warning: Local jar /home/user/Downloads/spark-1.4.0/bin/NetSetup.log does not exist, skipping.

Upvotes: 1

andriosr
andriosr

Reputation: 491

The text file should be in HDFS (If using HADOOP) or any other DFS you are using to support SPARK in order to pass relative paths for the application to read the data. So, you should put the file into the DFS for you application to work, otherwise only giving the absolute path from your OS file system.

Look here for instructions on how to add files to HDFS, and this related discussion that might help you.

Also, you are setting the clusters to be used by the application twice: in the Spark conf (setMaster("local[*]")):

val conf = new SparkConf().setAppName("SparkMe Application").setMaster("local[*]")

and in the submit (--master "local[*]"):

spark-submit --class SparkMeApp --master "local[*]" --jars target\scala-2.10\firstsparkapplication_2.10-1.0.jar NetSetup.log

You only need to do it once, choose one of them.

Upvotes: 0

Related Questions