pnhegde
pnhegde

Reputation: 695

How to write to multiple S3 buckets based on distinct values of a dataframe in an AWS Glue job?

I have a dataframe with account_id column. I want to group all of the distinct account_id rows and write to different S3 buckets. Writing to a new folder for each account_id within a given S3 bucket works too.

Upvotes: 0

Views: 784

Answers (1)

Prabhakar Reddy
Prabhakar Reddy

Reputation: 5124

If you want all similar account_ids to be present in one folder then you can achieve it via partitionBy function. Below is an example which will group all the account_ids and write them in parquet format to different folders. You can change the mode depending on your use case.

df.write.mode("overwrite").partitionBy('account_id').parquet('s3://mybucket/')

If you want multiple partitions then you can do so by adding the columns to partitionBy function. For example consider you have a column date with values of format yyyy/mm/dd then below snippet will create folders again inside account_id with multiple dates.

df.write.mode("overwrite").partitionBy('account_id','date').parquet('s3://mybucket/')

will write files to S3 in below format:

s3://mybucket/account_id=somevalue/date=2020/11/01
s3://mybucket/account_id=somevalue/date=2020/11/02
s3://mybucket/account_id=somevalue/date=2020/11/03
......
s3://mybucket/account_id=somevalue/date=2020/11/30

Upvotes: 2

Related Questions