Delta Standalone - Scan for Specific Data

Question

I'm using the Delta Standalone library to read data from a Delta Table. My goal is to assert that specific data was processed and persisted by an upstream function/service.

I am able to do this by naively retrieving all files as shown by the function below:

fun retrieveRecords(): List {
    val log = DeltaLog.forTable(configuration, DELTA_TABLE_LOCATION)
    val snapshot = log.snapshot()
    val iter = snapshot.open()
    val rows = mutableListOf()
    while (iter.hasNext()) {
        rows.add(iter.next())
    }
    return rows
}

This, however, does not scale. I've seen in the Delta Standalone documentation that the scan() operation looks promising, stating that is can be used to do the following:

Access the files that match the partition filter portion of the readPredicate with DeltaScan::getFiles. This returns a memory-optimized iterator over the metadata files in the table.

To further filter the returned files on non-partition columns, get the portion of input predicate not applied with DeltaScan::getResidualPredicate.

From my understanding, I would be able to use this scan() operation, pass the partitioned columns and access the files (essentially the path) attribute and do some conversions from there.

I'm struggling to:

Pass the partitions as the required Expression type in the scan() operation; and
Find a way to then convert the retrieved filepath's into RowRecords to do the assertions.

val iter = log.snapshot().scan(
    EqualTo(
        Column("partitioned_col_1", StringType()),
        Literal.of("partition_val_1"),
    )
    // Scan expects only a single Expression
    // Multiple Expressions are not allowed
    // ,
    // EqualTo(
    //    Column("partitioned_col_2", StringType()),
    //    Literal.of("partition_val_2"),
    // ),
    // EqualTo(
    //    Column("partitioned_col_3", StringType()),
    //    Literal.of("partition_val_3"),
    // )
).files

val paths = mutableListOf()
while (iter.hasNext()) {
    paths.add(iter.next().path)
}

// for path in paths:
//     read and convert into RowRecord

I seem to be missing a few key pieces of information here. Any advice would be greatly appreciated.

Delta Standalone - Scan for Specific Data

Answers (1)

Related Questions