alpacafondue
alpacafondue

Reputation: 391

Recursive CTE to remove duplicates

I'm looking to clean up event data that happens to have "duplicate" rows for a given day. I want to remove rows for a day that have more than one status based on the context of the next day's status value. Currently, I am using BigQuery and multiple CTE steps with self joins to iterate through days with multiple events to eventually "true up" every day to have a single status value.

I have tried using recursive CTEs with self joins, various window functions, etc without much luck. BigQuery doesn't allow analytic functions in recursive CTEs, including GROUP BYs :(

See below for an example of 2 iterations:

# data has multiple instances of days with more than one status (* = duplicate)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |*
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |*
| 2024-11-05 | active   |

# first iteration with removed rows (**)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |** (2024-11-02 is inactive, so remove this row)
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-03 | inactive |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |** (2024-11-05 is active, so remove this row)
| 2024-11-05 | active   |

# second iteration with removed rows (***)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |**
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*** (2024-11-04 has been deduped to active, so remove this row)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |**
| 2024-11-05 | active   |

# final desired set of deduplicated rows
| date       | status   |
|------------|----------|
| 2024-11-01 | inactive |
| 2024-11-02 | inactive |
| 2024-11-03 | active   |
| 2024-11-04 | active   |
| 2024-11-05 | active   |

I can imagine having to iterate N-times given the size of the data. Is there a recursive approach to this problem in SQL? Thanks!

Upvotes: 1

Views: 62

Answers (1)

keithwalsh
keithwalsh

Reputation: 814

  • CTE "a" sets status to NULL for dates with multiple statuses.
  • CTE "b" uses FIRST_VALUE to find next known status for dates with NULL status.
WITH a AS (
  SELECT date, IF(COUNT(DISTINCT status) = 1, MIN(status), NULL) AS status
  FROM sample_data
  GROUP BY date
),
b AS (
  SELECT
    date,
    COALESCE(
      status,
      FIRST_VALUE(status IGNORE NULLS) OVER (
        ORDER BY date
        ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
      )
    ) AS final_status
  FROM a
)
SELECT date, final_status AS status
FROM b
ORDER BY date;

Output:

date status
2024-11-01 inactive
2024-11-02 inactive
2024-11-03 active
2024-11-04 active
2024-11-05 active

Upvotes: 1

Related Questions