Reputation: 438
I have a spark-job, that I usually submit to a hadoop cluster from a local machine. When I submit it with spark 2.2.0 it works fine, but fails to start when i submit it with version 2.4.0.
Just the the SPARK_HOME
makes the difference.
drwxr-xr-x 18 me 576 Jan 23 14:15 spark-2.4.0-bin-hadoop2.6
drwxr-xr-x 17 me 544 Jan 23 14:15 spark-2.2.0-bin-hadoop2.6
I submit the job like
spark-submit \
--master yarn \
--num-executors 20 \
--deploy-mode cluster \
--executor-memory 8g \
--driver-memory 8g \
--class package.MyMain uberjar.jar \
--param1 ${BLA} \
--param2 ${BLALA}
Why does the new spark version refuse to take my uberjar? I did not find any changes in the spark 2.4 docu. Btw: The jar was built with spark 2.1 as dependency. Any ideas?
EDIT:
I think my problem is NOT related to spark failing to find things in my uberjar. Much rather I might have a problem with the new built in avro functionality. As before, I read avro files by using the implicit function spark.read.avro
from com.databricks.spark.avro._
. Spark 2.4.0 has some new built in avro things (most of them to be found in org.apache.spark:spark-avro_2.*11*:2.4.0
). The fail might have something to do with this.
java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at myproject.io.TrainingFileIO.readVectorAvro(TrainingFileIO.scala:59)
at myproject.training.MainTraining$.train(MainTraining.scala:37)
soo. i think the problem lies deeper. the actual error i get is:
Upvotes: 1
Views: 3895
Reputation: 438
It seems spark 2.4.0 needs --packages org.apache.spark:spark-avro_2.11:2.4.0
in order to run the old com.databricks.spark.avro code lines. Here is some description https://spark.apache.org/docs/latest/sql-data-sources-avro.html
So my problem didnt have anything to do with a missing class in my jar, it much rather had some problems with the new built-in avro things in the new spark version.
Upvotes: 6