Reputation: 869
I have a docker container running on my laptop with a master and three workers, I can launch the typical wordcount example by entering the ip of the master using a command like this:
bash-4.3# spark/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.scala.WordCount --master spark://spark-master:7077 /opt/spark-apps/learning-spark-mini-example_2.11-0.0.1.jar /opt/spark-data/README.md /opt/spark-data/output-5
I can see how the files have been generated inside output-5
but when I try to launch the process from outside, using the command:
docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS -e APP_ARGS="/opt/spark-data/README.md /opt/spark-data/output-5" spark-submit:2.4.0
Where
echo $SPARK_APPLICATION_JAR_LOCATION
/opt/spark-apps/learning-spark-mini-example_2.11-0.0.1.jar
echo $SPARK_APPLICATION_MAIN_CLASS
com.oreilly.learningsparkexamples.mini.scala.WordCount
And when I enter the page of the worker where the task is attempted, I can see that in line 11, the first of all, where the path for the first argument is collected, I have an error like this:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at com.oreilly.learningsparkexamples.mini.scala.WordCount$.main(WordCount.scala:11)
It is clear, in the zero position is not collecting the path of the first parameter, the one of the input file of which I want to do the wordcount.
The question is, why is docker not using the arguments passed through -e APP_ARGS="/opt/spark-data/README.md /opt/spark-data-output-5" ?
I already tried to run the job in a traditional way, loging to driver spark-master and running spark-submit command, but when i try to run the task with docker, it doesn't work.
It must be trivial, but i still have any clue. Can anybody help me?
SOLVED
I have to use a command like this:
docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS --env SPARK_APPLICATION_ARGS="/opt/spark-data/README.md /opt/spark-data/output-6" spark-submit:2.4.0
Resuming, i had to change -e APP_ARGS to --env SPARK_APPLICATION_ARGS
-e APP_ARGS is the suggested docker way...
Upvotes: 1
Views: 1183
Reputation: 869
This is the command that solves my problem:
docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS --env SPARK_APPLICATION_ARGS="/opt/spark-data/README.md /opt/spark-data/output-6" spark-submit:2.4.0
I have to use --env SPARK_APPLICATION_ARGS="args1 args2 argsN" instead of -e APP_ARGS="args1 args2 argsN".
Upvotes: 1