Matt Kocak
Matt Kocak

Reputation: 808

Use custom merge logic for incremental table in Dataform

I'm creating an incremental table in Dataform, but want to change how duplicate rows are merged. The default behavior is for duplicate rows to overwrite existing rows in the table (Merge rows in an incremental table). Instead, I would like to sum values in the new duplicate rows to the existing rows' values.

For example, consider the source table below. Here, sales can may be added to the table after the date that they occur (insert_date = date the sales were added to table, transaction_date = date the sales occurred).

insert_date transaction_date sales_quantity
2023-12-01 2023-12-01 10
2023-12-02 2023-12-02 20
2023-12-03 2023-12-01 5

I would like to incrementally (based on insert_date) generate a table that shows the total sales_quantity on a given transaction_date, where transaction_date is the unique key. The result would be as follows.

transaction_date sales_quantity
2023-12-01 15
2023-12-02 20

So I would like to sum the sales_quantity for duplicate rows, rather than overwrite it. But with the default behavior running on the sample code below, if I ran the incremental execution on 2023-12-01, 2023-12-02, and 2023-12-03 (where future insert_date rows don't exist in the source table on the date of the execution), I would end up with the following results at the end. During the last execution, the new row with a transaction_date of 2023-12-01 overwrote the previously added row, instead of adding to it.

SQLX file:

config {
  type: "incremental",
  uniqueKey: ["transaction_date"]
}

pre_operations {
  DECLARE insert_date_checkpoint DEFAULT (
    ${when(incremental(),
    `SELECT MAX(insert_date) FROM ${self()}`,
    `SELECT DATE("2000-01-01")`)}
  )
}

SELECT
  transaction_date,
  SUM(sales_quantity) AS sales_quantity
FROM
  ${ref("table")}
WHERE
  insert_date > insert_date_checkpoint
GROUP BY
  transaction_date

Results:

transaction_date sales_quantity
2023-12-01 5
2023-12-02 20

Is there a way to update the merge logic associated with an incremental table to use something other than overwrite? Or would I need to create a specific operations SQLX file to accomplish this?

Doing a full refresh of the data every time wouldn't be a good solution since the sales table can be very large/costly and I want to take advantage of its partitioning on the insert_date column.

Upvotes: 1

Views: 935

Answers (1)

Mel
Mel

Reputation: 307

You are right about the behavior of dataform overwriting duplicate rows but what you are trying to achieve is tricky. I haven't done it myself to be honest but the answer from this post might interest you.

The code snippets from this post have similarities to yours and I believe you can pickup from here.

Kindly Look into this document as well as this was mentioned in the same post.

Upvotes: 0

Related Questions