Reputation: 21
Using image version 1.3-debian9 shows the jars are available (attached screenshot).
Using image version preview (1.4-debian9) give the following error message (attached screenshot):
Py4JJavaError: An error occurred while calling o60.load.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
Command to create Dataproc cluster:
gcloudataproc clusters create ${CLUSTER_NAME} --bucket ${BUCKET} --zone us-east1-d --master-machine-type n1-standard-4 --master-boot-disk-size 1TB --num-workers 3 --worker-machine-type n1-standard-4 --worker-boot-disk-size 1TB --image-version=preview --scopes 'https://www.googleapis.com/auth/cloud-platform' --project ${PROJECT} --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh,gs://dataproc-initialization-actions/connectors/connectors.sh --metadata 'gcs-connector-version=1.9.16' --metadata 'bigquery-connector-version=0.13.16' --optional-components=ANACONDA,JUPYTER
Screenshots: 1.3-debian9 1.4-debian9
Upvotes: 2
Views: 216
Reputation: 10687
This is related to the same root cause described in Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket
In particular, in Dataproc 1.4 using Spark 2.4, the Hadoop dependencies are now bundled under /usr/lib/spark/lib
, and in general these dependency versions may differ from the versions of Hadoop classes found under /usr/lib/hadoop/lib
or /usr/lib/hadoop-mapreduce/lib
, etc. The issue here is that some of the auxiliary dependencies like AWS connectors (and probably Azure, etc connectors too) are not automatically included into the Spark-provided lib dir by default.
However, neither of the answers in that question are ideal. Downloading your own version of the AWS jarfile might be a hassle and also you might have version incompatibility issues. Alternatively, adding the full hadoop classpath
to the SPARK_DIST_CLASSPATH
pollutes your Spark classpath with the full duplicate set of Hadoop dependencies, which might also cause version incompatibility issues (and defeats the change toward packaging Spark's own copy of the Hadoop dependencies).
Instead, you can use a Dataproc initialization action with the following:
#!/bin/bash
# Name this something like add-aws-classpath.sh
AWS_SDK_JAR=$(ls /usr/lib/hadoop-mapreduce/aws-java-sdk-bundle-*.jar | head -n 1)
cat << EOF >> /etc/spark/conf/spark-env.sh
SPARK_DIST_CLASSPATH="\${SPARK_DIST_CLASSPATH}:/usr/lib/hadoop-mapreduce/hadoop-aws.jar"
SPARK_DIST_CLASSPATH="\${SPARK_DIST_CLASSPATH}:${AWS_SDK_JAR}"
EOF
Then use it at cluster creation time --initialization-actions=gs://mybucket/add-aws-classpath.sh
.
Then it should work again.
In general, you can diff the contents of /usr/lib/hadoop-mapreduce/lib
from the contents of /usr/lib/spark/lib
; you should see that things like the hadoop-mapreduce jar are present in Spark in Dataproc 1.4 but not in 1.3. So, if you run into any other missing jars, you can take this same approach to supplement the SPARK_DIST_CLASSPATH
.
Dataproc may decide to patch this up by default in a future patch version of 1.4, but using the init action whether or not the underlying system also adds those classpaths should be innocuous.
Upvotes: 3