Reputation: 543
Hi I've following rdd :
Header :
id|category|date|name|age
contents of rdd
1|b|12-10-2015|David|20
2|c|12-10-2015|Moses|40
3|b|18-12-2016|Tom|30
4|c|18-12-2016|Bill|60
I want to partition the data by category and date and save the files as follows :
12102015_b
1|b|12-10-2015|David|20
12102015_c
2|c|12-10-2015|Moses|40
18122016_b
3|b|18-12-2016|Tom|30
18122016_c
4|c|18-12-2016|Bill|60
Can I get any suggestions for this. Thanks in advance!!!
Upvotes: 0
Views: 259
Reputation: 3100
Suppose you have all your above data in pyspark dataframe df
.
Then you can use below statement to partition the data based on date and then category (however, you can decide the order based on your business logic.) and then save the dataFrame in various different format. However, I am using csv in below example.
df.write.partitionBy("date", "category").csv("location_of_path")
You can find a reference here for csv, parquet, partitionBy.
Hope this helps.
Regards,
Neeraj
Upvotes: 2