Spark Structured Streaming custom partition directory name

Question

I'm porting a streaming job (Kafka topic -> AWS S3 Parquet Files) from Kafka Connect to Spark Structured Streaming Job.

I partition my data by year/month/day.

The code is very simple:

        df.withColumn("year", functions.date_format(col("createdAt"), "yyyy"))
        .withColumn("month", functions.date_format(col("createdAt"), "MM"))
        .withColumn("day", functions.date_format(col("createdAt"), "dd"))
        .writeStream()
        .trigger(processingTime='15 seconds')
        .outputMode(OutputMode.Append())
        .format("parquet")
        .option("checkpointLocation", "/some/checkpoint/directory/")
        .option("path", "/some/directory/")
        .option("truncate", "false")
        .partitionBy("year", "month", "day")
        .start()
        .awaitTermination();

The output files are in the following directory (as expected):

/s3-bucket/some/directory/year=2021/month=01/day=02/

Question:

Is there a way to customize the output directory name? I need it to be

/s3-bucket/some/directory/2021/01/02/

For backward compatibility reasons.

Spark Structured Streaming custom partition directory name

Answers (1)

Related Questions