aironman
aironman

Reputation: 869

How to pass arguments to spark-submit using docker

I have a docker container running on my laptop with a master and three workers, I can launch the typical wordcount example by entering the ip of the master using a command like this:

bash-4.3# spark/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.scala.WordCount --master spark://spark-master:7077 /opt/spark-apps/learning-spark-mini-example_2.11-0.0.1.jar /opt/spark-data/README.md /opt/spark-data/output-5

I can see how the files have been generated inside output-5

but when I try to launch the process from outside, using the command:

docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS -e APP_ARGS="/opt/spark-data/README.md /opt/spark-data/output-5" spark-submit:2.4.0

Where

echo $SPARK_APPLICATION_JAR_LOCATION
/opt/spark-apps/learning-spark-mini-example_2.11-0.0.1.jar

echo $SPARK_APPLICATION_MAIN_CLASS
com.oreilly.learningsparkexamples.mini.scala.WordCount

And when I enter the page of the worker where the task is attempted, I can see that in line 11, the first of all, where the path for the first argument is collected, I have an error like this:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
    at com.oreilly.learningsparkexamples.mini.scala.WordCount$.main(WordCount.scala:11)

It is clear, in the zero position is not collecting the path of the first parameter, the one of the input file of which I want to do the wordcount.

The question is, why is docker not using the arguments passed through -e APP_ARGS="/opt/spark-data/README.md /opt/spark-data-output-5" ?

I already tried to run the job in a traditional way, loging to driver spark-master and running spark-submit command, but when i try to run the task with docker, it doesn't work.

It must be trivial, but i still have any clue. Can anybody help me?

SOLVED

I have to use a command like this:

docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS --env SPARK_APPLICATION_ARGS="/opt/spark-data/README.md /opt/spark-data/output-6" spark-submit:2.4.0

Resuming, i had to change -e APP_ARGS to --env SPARK_APPLICATION_ARGS

-e APP_ARGS is the suggested docker way...

Upvotes: 1

Views: 1183

Answers (1)

aironman
aironman

Reputation: 869

This is the command that solves my problem:

docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS --env SPARK_APPLICATION_ARGS="/opt/spark-data/README.md /opt/spark-data/output-6" spark-submit:2.4.0

I have to use --env SPARK_APPLICATION_ARGS="args1 args2 argsN" instead of -e APP_ARGS="args1 args2 argsN".

Upvotes: 1

Related Questions