tusher
tusher

Reputation: 63

Spark Write to S3 Storage Option

I am saving a spark dataframe to S3 bucket. The default storage type for the saved file is STANDARD. I need it to be STANDARD_IA. What is the option to achieve this. I have looked into the spark source codes and found no such options for spark DataFrameWriter in https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Below is the code I am using to write to S3:

val df = spark.sql(<sql>)
df.coalesce(1).write.mode("overwrite").parquet(<s3path>)

Edit: I am now using CopyObjectRequest to change the storage type of the created parquet:

val copyObjectRequest = new CopyObjectRequest(bucket, key, bucket, key).withStorageClass(<storageClass>)
s3Client.copyObject(copyObjectRequest)

Upvotes: 2

Views: 2012

Answers (2)

Rajath B
Rajath B

Reputation: 1

Adding below configuration worked for me on EMR (emr-6.15.0 with Hadoop 3.3.6 and Spark 3.4.1)

General S3A Client configuration

<property>
  <name>fs.s3a.create.storage.class</name>
  <value></value>
  <description>
      Storage class: standard, reduced_redundancy, intelligent_tiering, etc.
      Specify the storage class for S3A PUT object requests.
      If not set the storage class will be null
      and mapped to default standard class on S3.
  </description>
</property>

https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

EMR configuration in JSON format below:

[
  {
    "Classification": "emrfs-site",
    "Properties": {
      "fs.s3a.create.storage.class": "intelligent_tiering"
    }
  }
]

And in pySpark write using s3a

df.repartition(1).write.format("parquet").mode("append").save("s3a://bucket/path/")

Upvotes: 0

stevel
stevel

Reputation: 13430

As of July 2022 this has been implemented in the hadoop source tree in HADOOP-12020 by AWS S3 engineers.

It is still stabilising and should be out in the next feature release of hadoop 3.3.x, due late 2022.

  • anyone reading this before it ships: code is there to build yourself.
  • anyone readying this in 2023+. upgrade to hadoop 3.3.5 or later

Upvotes: 2

Related Questions