Scala Spark Iceberg writeStream. How to set bucket?

Question

I'm trying to write data to Iceberg table in Spark streaming (written in Scala).

Writer code:

    val streamResult = joined.writeStream
      .format("iceberg")
      .partitionBy("column1", "column2")
      .outputMode("append")
      .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
      .option("path", outputTable)
      .option("checkpointLocation", "s3://checkpointLocation")
      .option("fanout-enabled", "true")
      .start()
      .awaitTermination()

And I'm getting an error

org.apache.spark.sql.AnalysisException: bucket(20, columnX) is not currently supported

I know that on DataFrame write you can use 'bucketBy' method, but how to achieve it in writerStream ?

Versions:
Iceberg: 1.3.0
Spark: 3.3.2
Scala: 2.12

Data is read from Iceberg tables and I'm expecting that it will appear in output Iceberg table.

Edited:
Iceberg output table schema - partitioning and bucketing

    "partition-specs" : [ {
      "spec-id" : 0,
      "fields" : [ {
        "name" : "columnX_bucket",
        "transform" : "bucket[20]",
        "source-id" : 242,
        "field-id" : 1000
      }, {
        "name" : "column4_day",
        "transform" : "day",
        "source-id" : 10,
        "field-id" : 1001
      } ]
    } ]

Edited2:
Check my answer from below, but it comes with another question.

Scala Spark Iceberg writeStream. How to set bucket?

Answers (1)

Related Questions