Saving the pyspark dataframe to multiple AWS S3 buckets

Question

I am currently working on the use case, where

I want to write each partition to different S3 bucket.
Want to know, If I write the whole dataframe into single S3 bucket whose size is around 50GB then how the saved data would look like into bucket assuming I'm writing the data in JSON format; to be more specific, what would be file name in S3 when the dataframe is saved.

Robert Kossendey · Accepted Answer

First of all, why do you want to write each partition in a separate bucket?

To your second question: The saved data would depend on the amount of partitions you are saving to S3. You can always repartition the data by calling .repartition() on your data frame. Since the files are saved in the Hadoop File Format, the name will contain some specific numbers and a -part suffix similar to this: part-block-0-0-r-00000-.json

Saving the pyspark dataframe to multiple AWS S3 buckets

Answers (1)

Related Questions