Reputation: 695
I have a dataframe with account_id
column. I want to group all of the distinct account_id
rows and write to different S3 buckets. Writing to a new folder for each account_id
within a given S3 bucket works too.
Upvotes: 0
Views: 784
Reputation: 5124
If you want all similar account_ids to be present in one folder then you can achieve it via partitionBy function. Below is an example which will group all the account_ids and write them in parquet format to different folders. You can change the mode depending on your use case.
df.write.mode("overwrite").partitionBy('account_id').parquet('s3://mybucket/')
If you want multiple partitions then you can do so by adding the columns to partitionBy function. For example consider you have a column date with values of format yyyy/mm/dd
then below snippet will create folders again inside account_id
with multiple dates.
df.write.mode("overwrite").partitionBy('account_id','date').parquet('s3://mybucket/')
will write files to S3 in below format:
s3://mybucket/account_id=somevalue/date=2020/11/01
s3://mybucket/account_id=somevalue/date=2020/11/02
s3://mybucket/account_id=somevalue/date=2020/11/03
......
s3://mybucket/account_id=somevalue/date=2020/11/30
Upvotes: 2