Justin Naldzin
Justin Naldzin

Reputation: 21

Dataproc image version 1.4-debian9 (preview) missing AWS S3 jars (org.apache.hadoop.fs.s3a.S3AFileSystem)

Using image version 1.3-debian9 shows the jars are available (attached screenshot).

Using image version preview (1.4-debian9) give the following error message (attached screenshot):

Py4JJavaError: An error occurred while calling o60.load.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Command to create Dataproc cluster:

gcloudataproc clusters create ${CLUSTER_NAME} --bucket ${BUCKET} --zone us-east1-d --master-machine-type n1-standard-4 --master-boot-disk-size 1TB --num-workers 3 --worker-machine-type n1-standard-4 --worker-boot-disk-size 1TB --image-version=preview --scopes 'https://www.googleapis.com/auth/cloud-platform' --project ${PROJECT} --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh,gs://dataproc-initialization-actions/connectors/connectors.sh --metadata 'gcs-connector-version=1.9.16' --metadata 'bigquery-connector-version=0.13.16' --optional-components=ANACONDA,JUPYTER

Screenshots: 1.3-debian9 1.4-debian9

Upvotes: 2

Views: 216

Answers (1)

Dennis Huo
Dennis Huo

Reputation: 10687

This is related to the same root cause described in Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

In particular, in Dataproc 1.4 using Spark 2.4, the Hadoop dependencies are now bundled under /usr/lib/spark/lib, and in general these dependency versions may differ from the versions of Hadoop classes found under /usr/lib/hadoop/lib or /usr/lib/hadoop-mapreduce/lib, etc. The issue here is that some of the auxiliary dependencies like AWS connectors (and probably Azure, etc connectors too) are not automatically included into the Spark-provided lib dir by default.

However, neither of the answers in that question are ideal. Downloading your own version of the AWS jarfile might be a hassle and also you might have version incompatibility issues. Alternatively, adding the full hadoop classpath to the SPARK_DIST_CLASSPATH pollutes your Spark classpath with the full duplicate set of Hadoop dependencies, which might also cause version incompatibility issues (and defeats the change toward packaging Spark's own copy of the Hadoop dependencies).

Instead, you can use a Dataproc initialization action with the following:

#!/bin/bash
# Name this something like add-aws-classpath.sh

AWS_SDK_JAR=$(ls /usr/lib/hadoop-mapreduce/aws-java-sdk-bundle-*.jar | head -n 1)

cat << EOF >> /etc/spark/conf/spark-env.sh
SPARK_DIST_CLASSPATH="\${SPARK_DIST_CLASSPATH}:/usr/lib/hadoop-mapreduce/hadoop-aws.jar"
SPARK_DIST_CLASSPATH="\${SPARK_DIST_CLASSPATH}:${AWS_SDK_JAR}"
EOF

Then use it at cluster creation time --initialization-actions=gs://mybucket/add-aws-classpath.sh.

Then it should work again.

In general, you can diff the contents of /usr/lib/hadoop-mapreduce/lib from the contents of /usr/lib/spark/lib; you should see that things like the hadoop-mapreduce jar are present in Spark in Dataproc 1.4 but not in 1.3. So, if you run into any other missing jars, you can take this same approach to supplement the SPARK_DIST_CLASSPATH.

Dataproc may decide to patch this up by default in a future patch version of 1.4, but using the init action whether or not the underlying system also adds those classpaths should be innocuous.

Upvotes: 3

Related Questions