spark partition strategy comparison between date=dd-mm-yyyy vs yyyy={xxxx}/mm={mm}/dd={xx}

How to choose which partition strategy in spark on dates. I have a column in data frame as the date in 2020-02-19 format. should specify the date in partition columns while writing or create multiple columns from the date as dd, mm,yyyy in the table and specify columns yyyy, mm, dd in repartition?

What kind of issues will come if I specify each partition strategy

Upvotes: 0

Answers (1)

Thiago Baldim

Reputation: 7742

There is no actual gain breaking in one partition date=yyyy-mm-dd or in multiple partitions year=yyyy/month=mm/day=dd, if you have to process the last 10 days will give the same amount of data at the same time. The biggest difference is the way you query or the way you will maintain your data.

With one single partition your life will be easy to write queries for an specific day. I need to run for something 3 days ago. Or I need to query a date range from 1st of Jan to 1st of May. Having one partition with the date make your life much easier for that.

Having multiple partitions is easy to make monthly analysis, is easy to query a whole month or a whole year in a easy way. But you will loose the capability of query the data in a range.

Besides those features from each type of format, in a performance perspective this will not create any overhead for you, both solutions would bring the data in the same speed because you will not going to break the data in smaller files. I prefer to break just with one partition with the day due to be easy to maintain in point of view.

Upvotes: 3

spark partition strategy comparison between date=dd-mm-yyyy vs yyyy={xxxx}/mm={mm}/dd={xx}

Answers (1)

Related Questions