Reputation: 63
I am saving a spark dataframe to S3 bucket. The default storage type for the saved file is STANDARD. I need it to be STANDARD_IA. What is the option to achieve this. I have looked into the spark source codes and found no such options for spark DataFrameWriter in https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
Below is the code I am using to write to S3:
val df = spark.sql(<sql>)
df.coalesce(1).write.mode("overwrite").parquet(<s3path>)
Edit: I am now using CopyObjectRequest to change the storage type of the created parquet:
val copyObjectRequest = new CopyObjectRequest(bucket, key, bucket, key).withStorageClass(<storageClass>)
s3Client.copyObject(copyObjectRequest)
Upvotes: 2
Views: 2012
Reputation: 1
Adding below configuration worked for me on EMR (emr-6.15.0 with Hadoop 3.3.6 and Spark 3.4.1)
General S3A Client configuration
<property>
<name>fs.s3a.create.storage.class</name>
<value></value>
<description>
Storage class: standard, reduced_redundancy, intelligent_tiering, etc.
Specify the storage class for S3A PUT object requests.
If not set the storage class will be null
and mapped to default standard class on S3.
</description>
</property>
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
EMR configuration in JSON format below:
[
{
"Classification": "emrfs-site",
"Properties": {
"fs.s3a.create.storage.class": "intelligent_tiering"
}
}
]
And in pySpark write using s3a
df.repartition(1).write.format("parquet").mode("append").save("s3a://bucket/path/")
Upvotes: 0
Reputation: 13430
As of July 2022 this has been implemented in the hadoop source tree in HADOOP-12020 by AWS S3 engineers.
It is still stabilising and should be out in the next feature release of hadoop 3.3.x, due late 2022.
Upvotes: 2