Antalagor
Antalagor

Reputation: 438

spark 2.4 com.databricks.spark.avro trouble-shooting

I have a spark-job, that I usually submit to a hadoop cluster from a local machine. When I submit it with spark 2.2.0 it works fine, but fails to start when i submit it with version 2.4.0. Just the the SPARK_HOME makes the difference.

drwxr-xr-x  18 me  576 Jan 23 14:15 spark-2.4.0-bin-hadoop2.6
drwxr-xr-x  17 me  544 Jan 23 14:15 spark-2.2.0-bin-hadoop2.6

I submit the job like

spark-submit \
--master yarn \
--num-executors 20 \
--deploy-mode cluster \
--executor-memory 8g \
--driver-memory 8g \
--class package.MyMain uberjar.jar \
--param1 ${BLA} \
--param2 ${BLALA}

Why does the new spark version refuse to take my uberjar? I did not find any changes in the spark 2.4 docu. Btw: The jar was built with spark 2.1 as dependency. Any ideas?

EDIT: I think my problem is NOT related to spark failing to find things in my uberjar. Much rather I might have a problem with the new built in avro functionality. As before, I read avro files by using the implicit function spark.read.avro from com.databricks.spark.avro._. Spark 2.4.0 has some new built in avro things (most of them to be found in org.apache.spark:spark-avro_2.*11*:2.4.0). The fail might have something to do with this.

java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at myproject.io.TrainingFileIO.readVectorAvro(TrainingFileIO.scala:59)
at myproject.training.MainTraining$.train(MainTraining.scala:37)

soo. i think the problem lies deeper. the actual error i get is:

Upvotes: 1

Views: 3895

Answers (1)

Antalagor
Antalagor

Reputation: 438

It seems spark 2.4.0 needs --packages org.apache.spark:spark-avro_2.11:2.4.0 in order to run the old com.databricks.spark.avro code lines. Here is some description https://spark.apache.org/docs/latest/sql-data-sources-avro.html

So my problem didnt have anything to do with a missing class in my jar, it much rather had some problems with the new built-in avro things in the new spark version.

Upvotes: 6

Related Questions