Reputation: 131
I am trying to run a remote Spark Job through IntelliJ with a Spark HDInsight cluster (HDI 4.0). In my Spark application I am trying to read an input stream from a folder of parquet files from Azure blob storage using Spark's Structured Streaming built in readStream
function.
The code works as expected when I run it on a Zeppelin notebook attached to the HDInsight cluster. However, when I deploy my Spark application to the cluster, I encounter the following error:
java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
Subsequently, I am unable to read any data from blob storage.
The little information I found online suggested that this is caused by a version conflict between Spark and Hadoop. The application is run with Spark 2.4
prebuilt for Hadoop 2.7
.
To fix this, I ssh into each head and worker node of the cluster and manually downgrade the Hadoop dependencies to 2.7.3
from 3.1.x
to match the version in my local spark/jars
folder. After doing this , I am then able to deploy my application successfully. Downgrading the cluster from HDI 4.0 is not an option as it is the only cluster that can support Spark 2.4
.
To summarize, could the issue be that I am using a Spark download prebuilt for Hadoop 2.7
? Is there a better way to fix this conflict instead of manually downgrading the Hadoop versions on the cluster's nodes or changing the Spark version I am using?
Upvotes: 5
Views: 11273
Reputation: 1425
I faced the same issue due to the fact that I missed the % provided declarations in the built.sbt file.
While I run the sbt clean assembly, it creates a jar file that matches my local environment versions.
Previously
libraryDependencies ++= {
val sparkVersion = "2.3.0"
Seq(
"com.typesafe" % "config" % "1.3.1",
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion
)
}
Now
libraryDependencies ++= {
val sparkVersion = "2.3.0"
Seq(
"com.typesafe" % "config" % "1.3.1",
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
)
}
provided - This indicates that the specified dependency is expected to be provided by the Spark cluster or another environment at runtime. It means that you don't want to include this dependency in your project's JAR file because it's already available externally.
Upvotes: 0
Reputation: 131
After troubleshooting some previous methods I had attempted before, I've come across the following fix:
In my pom.xml
I excluded the hadoop-client
dependency automatically imported by the spark-core
jar. This dependency was version 2.6.5
which conflicted with the cluster's version of Hadoop. Instead, I import the version I require.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version.major}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependency>
After making this change, I encountered the error java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0
. Further research revealed this was due to a problem with the Hadoop configuration on my local machine. Per this article's advice, I modified the winutils.exe
version I had under C://winutils/bin
to be the version I required and also added the corresponding hadoop.dll
. After making these changes, I was able to successfully read data from blob storage as expected.
TLDR
Issue was the auto imported hadoop-client
dependency which was fixed by excluding it & adding the new winutils.exe
and hadoop.dll
under C://winutils/bin
.
This no longer required downgrading the Hadoop versions within the HDInsight cluster or changing my downloaded Spark version.
Upvotes: 5
Reputation: 133
Problem: I was facing same issue while running fat jar with hadoop 2.7 and spark 2.4 on cluster with hadoop 3.x , I was using maven shade plugin.
Observation: While building fat jar it was including jar org.apache.hadoop:hadoop-hdfs:jar:2.6.5 which has class class org.apache.hadoop.hdfs.web.HftpFileSystem. Which was causing problem in hadoop 3
Solution: I have excluded this jar while building fat jar as below.Issue got resolved.
Upvotes: 2