Reputation: 663
I am using scala and spark, my spark version is 2.4.3
My dataframe looks like this, there are other columns which i have not put and is not relavent.
+-----------+---------+---------+
|ts_utc_yyyy|ts_utc_MM|ts_utc_dd|
+-----------+---------+---------+
|2019 |01 |20 |
|2019 |01 |13 |
|2019 |01 |12 |
|2019 |01 |19 |
|2019 |01 |19 |
+-----------+---------+---------+
Basically i want to store the data in a bucketed format like
2019/01/12/data
2019/01/13/data
2019/01/19/data
2019/01/20/data
I am using following code snippet
df.write .partitionBy("ts_utc_yyyy","ts_utc_MM","ts_utc_dd") .format("csv") .save(outputPath)
But the problem is it is getting stored along with the column name like below.
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=12/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=13/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=19/data
ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=20/data
how do i save without column name in the folder name ?
Thanks.
Upvotes: 1
Views: 1300
Reputation: 339
This is the expected behaviour. Spark uses Hive partitioning so it writes using this convention, which enables partition discovery, filtering and pruning. In short, it optimises your queries by ensuring that the minimum amount of data is read.
Spark isn't really designed for the output you need. The easiest way for you to solve this is to have a downstream task that will simply rename the directories by splitting on the equals sign.
Upvotes: 3