Reputation: 808
I'm creating an incremental table in Dataform, but want to change how duplicate rows are merged. The default behavior is for duplicate rows to overwrite existing rows in the table (Merge rows in an incremental table). Instead, I would like to sum values in the new duplicate rows to the existing rows' values.
For example, consider the source table below. Here, sales can may be added to the table after the date that they occur (insert_date
= date the sales were added to table, transaction_date
= date the sales occurred).
insert_date | transaction_date | sales_quantity |
---|---|---|
2023-12-01 | 2023-12-01 | 10 |
2023-12-02 | 2023-12-02 | 20 |
2023-12-03 | 2023-12-01 | 5 |
I would like to incrementally (based on insert_date
) generate a table that shows the total sales_quantity
on a given transaction_date
, where transaction_date
is the unique key. The result would be as follows.
transaction_date | sales_quantity |
---|---|
2023-12-01 | 15 |
2023-12-02 | 20 |
So I would like to sum the sales_quantity
for duplicate rows, rather than overwrite it. But with the default behavior running on the sample code below, if I ran the incremental execution on 2023-12-01, 2023-12-02, and 2023-12-03 (where future insert_date
rows don't exist in the source table on the date of the execution), I would end up with the following results at the end. During the last execution, the new row with a transaction_date
of 2023-12-01 overwrote the previously added row, instead of adding to it.
SQLX file:
config {
type: "incremental",
uniqueKey: ["transaction_date"]
}
pre_operations {
DECLARE insert_date_checkpoint DEFAULT (
${when(incremental(),
`SELECT MAX(insert_date) FROM ${self()}`,
`SELECT DATE("2000-01-01")`)}
)
}
SELECT
transaction_date,
SUM(sales_quantity) AS sales_quantity
FROM
${ref("table")}
WHERE
insert_date > insert_date_checkpoint
GROUP BY
transaction_date
Results:
transaction_date | sales_quantity |
---|---|
2023-12-01 | 5 |
2023-12-02 | 20 |
Is there a way to update the merge logic associated with an incremental table to use something other than overwrite? Or would I need to create a specific operations SQLX file to accomplish this?
Doing a full refresh of the data every time wouldn't be a good solution since the sales table can be very large/costly and I want to take advantage of its partitioning on the insert_date
column.
Upvotes: 1
Views: 935
Reputation: 307
You are right about the behavior of dataform overwriting duplicate rows but what you are trying to achieve is tricky. I haven't done it myself to be honest but the answer from this post might interest you.
The code snippets from this post have similarities to yours and I believe you can pickup from here.
Kindly Look into this document as well as this was mentioned in the same post.
Upvotes: 0