Reputation: 2302
I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes. IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image).
I have a server which has the Spark client. The Spark client submits the application on Yarn cluster via the client-mode with PySpark.
I am using the following config (setting via PySpark Spark-conf) to enable the committer:
"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"
I also downloaded the spark-hadoop-cloud jar to the jars/
directory of the Spark-Home on the Nodemanagers and my Spark-client servers.
PRE __magic/
directory if I run aws s3 ls <write-path>
when the job is running.WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
anymore._SUCCESS
file gets created with (JSON) content. One of the key-value that I see in that file is "committer" : "magic"
.Hence, I believe my configs are getting applied correctly.
I have read in multiple articles that this committer is expected to show a performance boost (e.g. this article claims 57-77% time reduction). Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.
"spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol"
, my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
.Upvotes: 2
Views: 1573
Reputation: 13430
grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.
you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data). it is also correct, which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere
Upvotes: 3