Avinash Clinton
Avinash Clinton

Reputation: 543

partitioning data in rdd and saving partitioned chunks

Hi I've following rdd :

Header :

id|category|date|name|age

contents of rdd

1|b|12-10-2015|David|20
2|c|12-10-2015|Moses|40
3|b|18-12-2016|Tom|30
4|c|18-12-2016|Bill|60

I want to partition the data by category and date and save the files as follows :

12102015_b

1|b|12-10-2015|David|20

12102015_c

2|c|12-10-2015|Moses|40

18122016_b

3|b|18-12-2016|Tom|30

18122016_c

4|c|18-12-2016|Bill|60

Can I get any suggestions for this. Thanks in advance!!!

Upvotes: 0

Views: 259

Answers (1)

Neeraj Bhadani
Neeraj Bhadani

Reputation: 3100

Suppose you have all your above data in pyspark dataframe df.

Then you can use below statement to partition the data based on date and then category (however, you can decide the order based on your business logic.) and then save the dataFrame in various different format. However, I am using csv in below example.

df.write.partitionBy("date", "category").csv("location_of_path")

You can find a reference here for csv, parquet, partitionBy.

Hope this helps.

Regards,

Neeraj

Upvotes: 2

Related Questions