lucy
lucy

Reputation: 4506

Spark s3 write (s3 vs s3a connectors)

I am working on a job that runs on EMR and it saves thousands of partitions on s3. Partitions are year/month/day.

I have data from the last 50 years. Now when spark writes 10000 partitions, it takes around 1-hour using the s3a connection. It is extremely slow.

df.repartition($"year", $"month", $"day").write.mode("append").partitionBy("year", "month", "day").parquet("s3a://mybucket/data")

Then I tried with only s3 prefix and it took only a few minutes to save all the partitions on S3.

df.repartition($"year", $"month", $"day").write.mode("append").partitionBy("year", "month", "day").parquet("s3://mybucket/data")

When I overwritten 1000 partitions, s3 was very fast in compare to s3a

 df
 .repartition($"year", $"month", $"day")
 .write
 .option("partitionOverwriteMode", "dynamic")
 .mode("overwrite").partitionBy("year", "month", "day")
 .parquet("s3://mybucket/data")

As per my understanding, s3a is more mature and currently in use. s3/s3n are old connectors and they are deprecated. So I am wondering what to use? Should I use 's3`? What is the best s3 connect or s3 URI to use with EMR jobs that save data into s3?

Upvotes: 4

Views: 5589

Answers (2)

Zach
Zach

Reputation: 958

In case someone is on the same boat...when s3a:// is used, EMR Spark writes to any SSE-KMS enabled S3 bucket with a default AWS-managed KMS key regardless of:

  1. The bucket default KMS key settings
  2. The KMS key specified in the EMR configuration (eg. here)

Upvotes: 1

jvian
jvian

Reputation: 83

As Stevel pointed out, the s3:// connector used in Amazon EMR is built by amazon for EMR to interact with S3, and is the recommended way to do so according to Amazon EMR Work with storage and file systems:

Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.

Some more interesting stuff: The Apache Hadoop community also developed its own S3 connector and S3a:// is the actively maintained one. The Hadoop community had also used a connector that was named S3:// that probably added to confusion. From hadoop docs:

There are other Hadoop connectors to S3. Only S3A is actively maintained by the Hadoop project itself.

  1. Apache’s Hadoop’s original s3:// client. This is no longer included in Hadoop.
  2. Amazon EMR’s s3:// client. This is from the Amazon EMR team, who actively maintain it.
  3. Apache’s Hadoop’s s3n: filesystem client. This connector is no longer available: users must migrate to the newer s3a: client.

Upvotes: 7

Related Questions