Reputation: 369
Spark can use Hadoop S3A file system org.apache.hadoop.fs.s3a.S3AFileSystem
. By adding the following into the conf/spark-defaults.conf
, I can get spark-shell to log to the S3 bucket:
spark.jars.packages net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.eventLog.enabled true
spark.eventLog.dir s3a://spark-logs-test/
spark.history.fs.logDirectory s3a://spark-logs-test/
spark.history.provider org.apache.hadoop.fs.s3a.S3AFileSystem
Spark History Server also loads configuration from conf/spark-defaults.conf
, but it seems not to load spark.jars.packages
configuration, and throws ClassNotFoundException
:
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:256)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
The Spark source code for loading configuration is different in SparkSubmitArguments.scala and in HistoryServerArguments.scala, in particular the HistoryServerArguments does not seem to load packages.
Is there a way to add the org.apache.hadoop.fs.s3a.S3AFileSystem
dependency to the History Server?
Upvotes: 7
Views: 11433
Reputation: 856
OP's question and answer is useful and working find.
I leave some comments for someone who work with Spark 3.x version.
Let's take a look at Bitnami's spark 3.5.1 version image.
There are jar files for s3a as below.
$ docker run --name spark-test -it --rm bitnami/spark:3.5.1 ls -al jars/
...
aws-java-sdk-bundle-1.12.262.jar
guava-14.0.1.jar
hadoop-aws-3.3.4.jar
...
Here is a good reference about question for OP.
This is not related subject about OP's question but please remind,
Spark use hadoop as default and there is 3 dependencies to access s3a://
We need to specify access information(ex, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
Don't forget add lines to spark-default.conf or VM options
# I'm using minio for s3a
spark.hadoop.fs.s3a.endpoint http://minio:9000
spark.hadoop.fs.s3a.access.key xxx
spark.hadoop.fs.s3a.secret.key xxx
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.connection.ssl.enabled false
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.fast.upload true
spark.eventLog.dir s3a://spark-logs-test/ # for driver
spark.history.fs.logDirectory s3a://spark-logs-test/ # for spark history server
SPARK_HISTORY_OPTS= \
-Dspark.eventLog.enabled=true \
-Dspark.eventLog.dir=s3a://spark-logs-test/ \
-Dspark.history.fs.logDirectory=s3a://spark-logs-test/ \
-Dspark.hadoop.fs.s3a.endpoint=http://minio:9000 \
-Dspark.hadoop.fs.s3a.access.key=xxx \
-Dspark.hadoop.fs.s3a.secret.key=xxx \
-Dspark.hadoop.fs.s3a.path.style.access=true \
-Dspark.hadoop.fs.s3a.connection.ssl.enabled=false \
-Dspark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
If you set as s3a://spark-history/logs/
, Create bucket, directory and upload any file to the path. MinIO will ignore empty directory.
Upvotes: 0
Reputation: 53
I added the following jars into my SPARK_HOME/jars directory and it works great:
Edit :
And my spark_defaults.conf has below 3 parameters set :
spark.eventLog.enabled : true
spark.eventLog.dir : s3a://bucket_name/folder_name
spark.history.fs.logDirectory : s3a://bucket_name/folder_name
Upvotes: 1
Reputation: 1272
on EMR emr-5.16.0:
I've added the following to my cluster bootstrap:
sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-core-*.jar /usr/lib/spark/jars/
sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-s3-*.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hadoop/hadoop-aws.jar /usr/lib/spark/jars/
Then in the config of the cluster:
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://some/path',
'spark.history.fs.logDirectory': 's3a://some/path',
'spark.eventLog.enabled': 'true'
}
}
If you're going to test this, first stop the spark history server:
sudo stop spark-history-server
Make the config changes
sudo vim /etc/spark/conf.dist/spark-defaults.conf
Then run the copying of JARs as above
Then restart the spark history server:
sudo /usr/lib/spark/sbin/start-history-server.sh
Thanks for the answers above!
Upvotes: 2
Reputation: 369
Did some more digging and figured it out. Here's what was wrong:
The JARs necessary for S3A can be added to $SPARK_HOME/jars
(as described in SPARK-15965)
The line
spark.history.provider org.apache.hadoop.fs.s3a.S3AFileSystem
in $SPARK_HOME/conf/spark-defaults.conf
will cause
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3a.S3AFileSystem.<init>(org.apache.spark.SparkConf)
exception. That line can be safely removed as suggested in this answer.
To summarize:
I added the following JARs to $SPARK_HOME/jars
:
and added this line to $SPARK_HOME/conf/spark-defaults.conf
spark.history.fs.logDirectory s3a://spark-logs-test/
You'll need some other configuration to enable logging in the first place, but once the S3 bucket has the logs, this is the only configuration that is needed for the History Server.
Upvotes: 10