Reputation: 9169
I'm trying to run a simple spark to s3 app from a server but I keep getting the below error because the server has hadoop 2.7.3 installed and it looks like it doesn't include the GlobalStorageStatistics class. I have hadoop 2.8.x defined in my pom.xml file but trying to test it by running it locally.
How can I make it ignore searching for that or what workaround options are there to include that class if I have to go with hadoop 2.7.3?
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
at com.ibm.cos.jdbc2DF$.main(jdbc2DF.scala:153)
at com.ibm.cos.jdbc2DF.main(jdbc2DF.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StorageStatistics
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 28 more
Upvotes: 12
Views: 41149
Reputation: 13490
You can't mix bits of Hadoop and expect things to work. It's not just the close coupling between internal classes in hadoop-common and hadoop-aws, its things like the specific version of the amazon SDK the hadoop-aws module was built it.
If you get ClassNotFoundException
or MethodNotFoundException
stack traces when trying to work with s3a://
URLs, JAR version mismatch is the likely cause.
Using the RFC2117 MUST/SHOULD/MAY terminology, here are the rules to avoid this situation:
ClassNotFoundException
or MethodNotFoundException
indicating a mismatch in hadoop-common and hadoop-aws.org.apache.fs.s3a.S3AFileSystem
which the classloader can't find -the exact class depends on the mismatch of JARsaws-sdk-bundle
.bundle.jar
.ClassNotFoundException
.spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
declaration most Spark+S3A examples on SO do. That is a superstition which is passed on from one SO to the next. That binding information is set in the XML file core-default.xml
in hadoop-common.jar
; manually setting it can only break things.Finally: the ASF documentation: The S3A Connector.
Note: that link is to the latest release. If you are using an older release it will lack features. Upgrade before complaining that the s3a connector doesn't do what the documentation says it does.
Upvotes: 35
Reputation: 179
I found stevel's answer above to be extremely helpful. His information inspired my write-up here. I will copy the relevant parts below. My answer is tailored to a Python/Windows context, but I suspect most points are still relevant in a JVM/Linux context.
This answer is intended for Python developers, so it assumes we will install Apache Spark indirectly via pip. When pip installs PySpark, it collects most dependencies automatically, as seen in .venv/Lib/site-packages/pyspark/jars
. However, to enable the S3A Connector, we must track down the following dependencies manually:
hadoop-aws
aws-java-sdk-bundle
winutils.exe
(and hadoop.dll
) <-- Only needed in WindowsAssuming we're installing Spark via pip, we can't pick the Hadoop version directly. We can only pick the PySpark version, e.g. pip install pyspark==3.1.3
, which will indirectly determine the Hadoop version. For example, PySpark 3.1.3
maps to Hadoop 3.2.0
.
All Hadoop JARs must have the exact same version, e.g. 3.2.0
. Verify this with cd pyspark/jars && ls -l | grep hadoop
. Notice that pip install pyspark
automatically included some Hadoop JARs. Thus, if these Hadoop JARs are 3.2.0
, then we should download hadoop-aws:3.2.0
to match.
winutils.exe
must have the exact same version as Hadoop, e.g. 3.2.0
. Beware, winutils releases are scarce. Thus, we must carefully pick our PySpark/Hadoop version such that a matching winutils version exists. Some PySpark/Hadoop versions do not have a corresponding winutils release, thus they cannot be used on Windows.
aws-java-sdk-bundle
must be compatible with our hadoop-aws
choice above. For example, hadoop-aws:3.2.0
depends on aws-java-sdk-bundle:1.11.375
, which can be verified here.
With the above constraints in mind, here is a reliable algorithm for installing PySpark with S3A support on Windows:
Find latest available version of winutils.exe
here. At time of writing, it is 3.2.0
. Place it at C:/hadoop/bin
. Set environment variable HADOOP_HOME
to C:/hadoop
and (important!) add %HADOOP_HOME%/bin
to PATH
.
Find latest available version of PySpark that uses Hadoop version equal to above, e.g. 3.2.0
. This can be determined by browsing PySpark's pom.xml
file across each release tag. At time of writing, it is 3.1.3
.
Find the version of aws-java-sdk-bundle
that hadoop-aws
requires. For example, if we're using hadoop-aws:3.2.0
, then we can use this page. At time of writing, it is 1.11.375
.
Create a venv and install the PySpark version from step 2.
python -m venv .venv
source .venv/Scripts/activate
pip install pyspark==3.1.3
cd .venv/Lib/site-packages/pyspark/jars
ls -l | grep hadoop
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
cd C:/hadoop/bin
curl -O https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-3.2.0/bin/winutils.exe
curl -O https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-3.2.0/bin/hadoop.dll
To verify your setup, try running the following script.
import pyspark
spark = (pyspark.sql.SparkSession.builder
.appName('my_app')
.master('local[*]')
.config('spark.hadoop.fs.s3a.access.key', 'secret')
.config('spark.hadoop.fs.s3a.secret.key', 'secret')
.getOrCreate())
# Test reading from S3.
df = spark.read.csv('s3a://my-bucket/path/to/input/file.csv')
print(df.head(3))
# Test writing to S3.
df.write.csv('s3a://my-bucket/path/to/output')
You'll need to substitute your AWS keys and S3 paths, accordingly.
If you recently updated your OS environment variables, e.g.
HADOOP_HOME
andPATH
, you might need to close and re-open VSCode to reflect that.
Upvotes: 14