iceberg `expireSnapshots` and `deleteOrphanFiles` not work

Question

Here is my configuration about spark and iceberg in POM (I can make sure these configurations are correct, as I can insert data into iceberg table):

        
        
            org.apache.spark
            spark-core_2.12
            3.1.1
        
        
            org.apache.spark
            spark-sql_2.12
            3.1.1
        

        
        
            org.apache.iceberg
            iceberg-spark-runtime-3.1_2.12
            1.2.0

I want to test the expireSnapshots and deleteOrphanFiles API. Before I actually run these API, I have already insert data into an iceberg table correctly. The following is the directory structure:

$ tree db
db
└── table_name
    ├── data
    │   └── year=2024
    │       └── month=04
    │           └── day=07
    │               └── dataType=PD
    │                   └── 00000-0-04211945-446e-4d17-83c4-7e8447f5e4e4-00001.parquet
    └── metadata
        ├── d2d68629-eb2f-4582-b2ed-ebb17f923f72-m0.avro
        ├── snap-8662509695078634239-1-d2d68629-eb2f-4582-b2ed-ebb17f923f72.avro
        ├── v1.metadata.json
        ├── v2.metadata.json
        └── version-hint.text

There are two metadata files. As I first create the table, then I insert data into partitions. The following is my program. Here is the way I am thinking, I first list all snapshot ids, it should have only one, as I just insert once. Then I use expireSnapshots, it should mark all snapshots as orphan, and remove them from the metadata files. After that, I call deleteOrphanFiles, it should actually delete all data files. If I refresh the table and get the snapshot ids. From my expectation, I should get an empty list, and all data files under data folder should be deleted. But I still see these data files and able to query the snapshot id. Can you help expalin why? (I can make sure no other spark jobs, branches, tags are using this table. I insert data 7 hours ago)

import scala.collection.JavaConverters.iterableAsScalaIterableConverter
import scala.concurrent.duration.DurationInt

import org.apache.hadoop.conf.Configuration
import org.apache.iceberg.catalog.TableIdentifier
import org.apache.iceberg.hadoop.HadoopCatalog
import org.apache.iceberg.spark.actions.SparkActions
import org.apache.spark.sql.SparkSession

object ExpireSnapshotDemo {
  def main(args: Array[String]): Unit = {
    @transient implicit val spark = SparkSession
      .builder()
      .master("local[*]")
      .config("spark.driver.bindAddress", "127.0.0.1")
      .appName("IcebergTableCreationExample")
      .config(
        "spark.sql.extensions",
        "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
      )
      .config(
        "spark.sql.catalog.local",
        "org.apache.iceberg.spark.SparkCatalog"
      )
      .config("spark.sql.catalog.local.type", "hadoop")
      .config(
        "spark.sql.catalog.local.warehouse",
        "/Users/Code/spark/iceberg"
      )
      .getOrCreate()

    val catalog = new HadoopCatalog(
      new Configuration(),
      "/Users/Code/spark/iceberg"
    )
    // when use hadoop catalog, just need to specify the namespace and table name
    val tableIdentifier = TableIdentifier.of("db", "table_name")
    //    val tableFullName   = "local.db.table_name"
    val table = catalog.loadTable(tableIdentifier)

    println(table.snapshots().asScala.map(_.snapshotId()).toList)

    val expireTimeMilliseconds = System.currentTimeMillis() - 2.hours.toMillis
    println(expireTimeMilliseconds)

    // remove them from metadata
    SparkActions
      .get(spark)
      .expireSnapshots(table)
      .expireOlderThan(expireTimeMilliseconds)
      .execute()

    table.refresh()

    println(table.snapshots().asScala.map(_.snapshotId()).toList)

    SparkActions
      .get(spark)
      .deleteOrphanFiles(table)
      .olderThan(expireTimeMilliseconds)
      .execute()

    table.refresh()

    println(table.snapshots().asScala.map(_.snapshotId()).toList)
  }
}

iceberg `expireSnapshots` and `deleteOrphanFiles` not work

Answers (1)

Related Questions