Sean
Sean

Reputation: 573

Understanding window functions to deduplicate records while retaining true changes

I asked this in dba stack exchange but haven't had any luck. Cross-posting.

SQLFIDDLE

I am close to figuring this out but I'm just stuck at a wall. I'm attempting to understand a post by Aaron Betrand and apply it to a situation I've encountered where I have a changes table that's heavily duplicated due to prior design error I'm inheriting. The sample data set is identical in concept to my real data set, except SortOrder would usually be a datetime value and not an integer. The code I've tried is here:

; with main as (
   select *, ROW_NUMBER() over (partition by ID, Val, sortorder order by ID,
      SortOrder) as "Rank",
      row_number() over (partition by ID, val order by ID, sortorder) as "s_rank" 
   from 
      (values (1, 'A', 1), (1, 'A', 1), (1, 'B', 2), (1, 'C', 3), (1, 'B', 4),
              (1, 'A', 5), (1, 'A', 5), (2, 'A', 1), (2, 'B', 2), (2, 'A', 3), 
              (3, 'A', 1), (3, 'A', 1), (3, 'A', 2)
      ) as x("ID", "VAL", "SortOrder")
   group by id, val, SortOrder
   --order by ID, "SortOrder"
),
cte_rest as (
   select *
   from main
   where "s_rank" > 1
)

select *
from main
left join cte_rest rest
   on main.id = rest.id
   and main.s_rank > 1
   and main.SortOrder = rest.SortOrder
--where not exists (select 1 from cte_rest r where r.id = main.id and r.val <> main.VAL and main.s_rank < s_rank)
order by main.ID, main.SortOrder

The results are almost valid; however, the last row highlights a situation that I haven't been able to account for: the date changes, the value doesn't. I want the last record to be excluded because it's not a true value change.

╔════╦═════╦═══════════╦══════╦════════╦══════╦══════╦═══════════╦══════╦════════╗
║ ID ║ VAL ║ SortOrder ║ Rank ║ s_rank ║  ID  ║ VAL  ║ SortOrder ║ Rank ║ s_rank ║
╠════╬═════╬═══════════╬══════╬════════╬══════╬══════╬═══════════╬══════╬════════╣
║  1 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ B   ║         2 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ C   ║         3 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  1 ║ B   ║         4 ║    1 ║      2 ║ 1    ║ B    ║ 4         ║ 1    ║ 2      ║
║  1 ║ A   ║         5 ║    1 ║      2 ║ 1    ║ A    ║ 5         ║ 1    ║ 2      ║
║  2 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  2 ║ B   ║         2 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  2 ║ A   ║         3 ║    1 ║      2 ║ 2    ║ A    ║ 3         ║ 1    ║ 2      ║
║  3 ║ A   ║         1 ║    1 ║      1 ║ NULL ║ NULL ║ NULL      ║ NULL ║ NULL   ║
║  3 ║ A   ║         2 ║    1 ║      2 ║ 3    ║ A    ║ 2         ║ 1    ║ 2      ║
╚════╩═════╩═══════════╩══════╩════════╩══════╩══════╩═══════════╩══════╩════════╝

A colleague of mine suggested this code, and while I can follow how it arrives, I don't understand why the first code sample doesn't work. It feels to me like this would require a lot of extra parsing, and with a large data set I'd be worried about performance impacts.


WITH cte1
     AS (SELECT [id]
              , [val]
              , [sortorder]
              , ROW_NUMBER() OVER(PARTITION BY [id]
                                             , [val]
                                             , [sortorder]
                ORDER BY [id]
                       , [sortorder]) AS "rankall"
         FROM   (VALUES
                        ( 1, 'A', 1 ),
                        ( 1, 'A', 1 ),
                        ( 1, 'B', 2 ),
                        ( 1, 'C', 3 ),
                        ( 1, 'B', 4 ),
                        ( 1, 'A', 5 ),
                        ( 1, 'A', 5 ),
                        ( 2, 'A', 1 ),
                        ( 2, 'B', 2 ),
                        ( 2, 'A', 3 ),
                        ( 3, 'A', 1 ),
                        ( 3, 'A', 1 ),
                        ( 3, 'A', 2 )) AS x("id", "val", "sortorder")),
     ctedropped
     AS (SELECT [id]
              , [val]
              , [sortorder]
              , ROW_NUMBER() OVER(PARTITION BY [id]
                                             , [val]
                                             , [sortorder]
                ORDER BY [id]
                       , [sortorder]) AS "rankall"
         FROM   cte1
         WHERE  [cte1].[rankall] > 1)
     SELECT [cte1].[id]
          , [cte1].[val]
          , [cte1].[sortorder]
     FROM   cte1
     WHERE  NOT EXISTS
     (
         SELECT *
         FROM   [ctedropped]
         WHERE  [cte1].[id] = [ctedropped].[id] AND 
                [cte1].[val] = [ctedropped].[val] AND 
                [cte1].[rankall] = [ctedropped].[rankall]
     )
     ORDER BY [cte1].[id]
            , [cte1].[sortorder];

Upvotes: 1

Views: 205

Answers (2)

dnoeth
dnoeth

Reputation: 60482

If you just want to remove rows where the value doesn't change you can apply this logic:

WITH cte1 AS
 (
   SELECT [id]
        , [val]
        , [sortorder]
        , Lag(val) Over(PARTITION BY [id]
                        ORDER BY [sortorder]) AS prevval
   FROM    demo
 )
SELECT * 
FROM cte1
WHERE prevval IS NULL  -- first row
   OR prevval <> val   -- value changed

See fiddle

Upvotes: 2

avery_larry
avery_larry

Reputation: 2135

I think what you want is the results you're getting but only if the VAL has changed. Your query doesn't work because you have nothing in your query that accomplishes the but. I think somewhere you can use lag() to make sure that the rest.VAL does not equal the previous row's main.VAL. One idea -- you can add the previous value to the main CTE using lag and then use the previous VAL in your join criteria. There are probably better/more elegant ways to do this.

; with main as (
   select *, ROW_NUMBER() over (partition by ID, Val, sortorder order by ID,
      SortOrder) as "Rank",
      row_number() over (partition by ID, val order by ID, sortorder) as "s_rank",
      lag(VAL,1) over (order by ID,sortorder) as prevVAL -- Here is the lag that populates the previous VAL
   from 
      (values (1, 'A', 1), (1, 'A', 1), (1, 'B', 2), (1, 'C', 3), (1, 'B', 4),
              (1, 'A', 5), (1, 'A', 5), (2, 'A', 1), (2, 'B', 2), (2, 'A', 3), 
              (3, 'A', 1), (3, 'A', 1), (3, 'A', 2)
      ) as x("ID", "VAL", "SortOrder")
   group by id, val, SortOrder
   --order by ID, "SortOrder"
),
cte_rest as (
   select *
   from main
   where "s_rank" > 1
)
select *
from main

left join cte_rest rest
   on main.id = rest.id
   and main.s_rank > 1
   and main.SortOrder = rest.SortOrder
   and rest.VAL <> main.prevVAL -- Here is where we make sure there is a change in VAL
 --where not exists (select 1 from cte_rest r where r.id = main.id and r.val <> main.VAL and main.s_rank < s_rank)
order by main.ID, main.SortOrder

Upvotes: 1

Related Questions