Reputation: 299
I have created an external table pointing to Azure ADLS with parquet storage and while inserting the data to that table I am getting the below error. I am using Databricks for the execution
org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2, org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat), please specify the fully qualified class name.;
This was perfectly working fine yesterday and I have started getting this error from today.
I couldn't find any answer in the internet on why is this happenning.
Upvotes: 3
Views: 7913
Reputation: 51
For me, this issue was occurring on EMR. This error suggests that Spark is finding more than one class that can be used to read Parquet files and is unable to decide which one to use automatically.
My issue was resolved after changing the scope to provided in pom.xml.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
The issue can also be solved by spark.read.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").load("<path_to_parquet_file>")
, but it requires changes at multiple places in codebase if you have numerous such read statements.
Upvotes: 0
Reputation: 413
You may be using spark2 while your dependencies in your jar are using spark3.
Ensure that the version of spark is consistent across the dependencies in your jar & the spark version you are using.
You can check the spark version you are running in scala with:
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder().appName("Spark Version Checker").getOrCreate()
val version: String = spark.version
println(s"Spark version: $version")
Then, depending on your build tool, you may be able to investigation the jars available in your dependency tool. If you are using buck or bazel, you can investigate the dependencies of your build target with:
buck query "deps(target)" > dependencies.list
And then view dependencies.list to identify the versions of spark that exist in your jar.
Upvotes: 0
Reputation: 1
I had a similar issue, which is Caused by Jar package dependency conflict. I use maven to package my spark jar with maven-shade-plugin, the plugin exclude the conflicting jar. And it works for me.
this is the code of pom.xml
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.4.0</version>
<configuration>
<artifactSet>
<excludes>
<exclude>org.scala-lang:*:*</exclude>
<exclude>org.apache.spark:*:*</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<finalName>spark_anticheat_shaded</finalName>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
Upvotes: 0
Reputation: 21
You may have more than 1 jar file in spark/jars/ directory for example - spark-sql_2.12-2.4.4 and spark-sql_2.12-3.0.3 which may lead to multiple class issue.
Upvotes: 2
Reputation: 1391
If you want a workaround without cleaning up dependencies. Here is how you choose one of the sources (exemplified with "org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat"):
Replace:
spark.read.parquet("<path_to_parquet_file>")
With
spark.read.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").load("<path_to_parquet_file>")
Upvotes: 2
Reputation: 299
This issue has been fixed, the reason for the error was, we installed spark sqldb connector provided by Azure with uber jar which also got dependencies wrt parquet file formatter.
Upvotes: 1