Spark partitionBy | save by column value rather than columnName={value}

Question

I am using scala and spark, my spark version is 2.4.3

My dataframe looks like this, there are other columns which i have not put and is not relavent.

+-----------+---------+---------+
|ts_utc_yyyy|ts_utc_MM|ts_utc_dd|
+-----------+---------+---------+
|2019       |01       |20       |
|2019       |01       |13       |
|2019       |01       |12       |
|2019       |01       |19       |
|2019       |01       |19       |
+-----------+---------+---------+

Basically i want to store the data in a bucketed format like

2019/01/12/data

2019/01/13/data

2019/01/19/data

2019/01/20/data

I am using following code snippet

  df.write
  .partitionBy("ts_utc_yyyy","ts_utc_MM","ts_utc_dd")
    .format("csv")
    .save(outputPath)

But the problem is it is getting stored along with the column name like below.

ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=12/data

ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=13/data

ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=19/data

ts_utc_yyyy=2019/ts_utc_MM=01/ts_utc_dd=20/data

how do i save without column name in the folder name ?

Thanks.

stereosky · Accepted Answer

This is the expected behaviour. Spark uses Hive partitioning so it writes using this convention, which enables partition discovery, filtering and pruning. In short, it optimises your queries by ensuring that the minimum amount of data is read.

Spark isn't really designed for the output you need. The easiest way for you to solve this is to have a downstream task that will simply rename the directories by splitting on the equals sign.

Spark partitionBy | save by column value rather than columnName={value}

Answers (1)

Related Questions