Alin I
Alin I

Reputation: 590

RANK() OVER PARTITION with RANK resetting

How can I get a RANK that restarts at partition change? I have this table:

ID    Date        Value  
1     2015-01-01  1  
2     2015-01-02  1 <redundant  
3     2015-01-03  2  
4     2015-01-05  2 <redundant  
5     2015-01-06  1  
6     2015-01-08  1 <redundant  
7     2015-01-09  1 <redundant  
8     2015-01-10  2  
9     2015-01-11  3  
10    2015-01-12  3 <redundant  

and I'm trying to delete all the rows where the Value is not changed from the previous entry (marked with < redundant). I've tried using cursors but it takes too long, as the table has ~50 million rows.

I've also tried using RANK:

SELECT ID, Date, Value,
RANK() over(partition by Value order by Date ASC) Rank,
FROM DataLogging 
ORDER BY Date ASC 

but I get:

ID    Date        Value  Rank   (Rank)
1     2015-01-01  1      1      (1)
2     2015-01-02  1      2      (2)
3     2015-01-03  2      1      (1)
4     2015-01-05  2      2      (2)
5     2015-01-06  1      3      (1)
6     2015-01-08  1      4      (2)
7     2015-01-09  1      5      (3)
8     2015-01-10  2      3      (1)
9     2015-01-11  3      1      (1)
10    2015-01-12  3      2      (2)

in parantheses is the Rank I would want, so that I can filter out rows with Rank = 1 and delete the rest of the rows.

EDIT: I've accepted the answer that seemed the easiest to write, but unfortunately none of the answers runs fast enough for deleting the rows. In the end I've decided to use the CURSOR afterall. I've split the data in chuncks of about 250k rows and the cursor runs through and deletes the rows in ~11 mins per batch of 250k rows, and the answers below, with DELETE, take ~35 mins per batch of 250k rows.

Upvotes: 5

Views: 14940

Answers (5)

Ruchika Gujarathi
Ruchika Gujarathi

Reputation: 1

Worked for my case! thanks I had to fetch the report_to change for an employee with respect to the previous reports_to valueand effdt. In other words, fetcth min effective date row for each reports_to change for an employee.

with tocheck as ( select T.emplid,T.reports_to,T.effdt, lag(reports_to) over (order by effdt) as prev_value from PS_JOB t ) select * from tocheck where reports_to <> prev_value or prev_value is null;

added changes further as p

Upvotes: 0

Balah
Balah

Reputation: 2540

This is interesting so I'd thought I'd jump in. Unfortunately, creating a solution with RANK() (or rather, ROW_NUMBER()) without first transforming the data looks to be unobtainable. In an attempt to transform the data, I came up with this solution that uses 1 ROW_NUMBER():

;WITH Ordered AS
(
    SELECT ROW_NUMBER() OVER (ORDER BY [Date]) AS [Row], *
    FROM DataLogging
),
Final AS
(
    SELECT
        o1.*, NULLIF(o1.Value - ISNULL(o2.Value, o1.Value - 1), 0) [Change]
    FROM
        Ordered o1
        LEFT JOIN Ordered o2 ON
            o1.[Row] = o2.[Row] + 1
)
SELECT * FROM Final

In the last Change column, the value will be NULL if there is no change in value (but will have the difference if there is a change).

So to do the delete, change the select to

DELETE FROM DataLogging where Change IS NULL

Edit: Lag would work here too but I was visualizing the solution as I went along and completely forgot about that.

Upvotes: 0

paparazzo
paparazzo

Reputation: 45096

select * 
from  ( select ID, Date, Value, lag(Value, 1, 0) over (order by ID) as ValueLag 
        from table ) tt
where ValueLag is null or ValueLag <> Value  

if the order is Date then over (order by Date)

this should show you good and bad - it is based on ID - it you need date then revise
it may look like a long way around but it should be pretty efficient

declare @tt table  (id tinyint, val tinyint);
insert into @tt values 
( 1, 1),
( 2, 1),
( 3, 2),
( 4, 2),
( 5, 1),
( 6, 1),
( 7, 1),
( 8, 2),
( 9, 3),
(10, 3);

select id, val, LAG(val) over (order by id) as lagVal
from @tt;

-- find the good
select id, val 
from ( select id, val, LAG(val) over (order by id) as lagVal
       from @tt 
     ) tt
where  lagVal is null or lagVal <> val 

-- select the bad 
select tt.id, tt.val 
  from @tt tt
  left join ( select id, val 
                from ( select id, val, LAG(val) over (order by id) as lagVal
                         from @tt 
                     ) ttt
               where   ttt.lagVal is null or ttt.lagVal <> ttt.val 
            ) tttt 
    on tttt.id = tt.id 
 where tttt.id is null

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269733

If you want to delete the rows, I would suggest you use lag():

with todelete as (
      select t.*, lag(value) over (order by date) as prev_value
      from t
     )
delete from todelete
    where value = prev_value;

I'm not quite sure what rank() has to do with the problem.

EDIT:

To see the rows not deleted with the same logic:

with todelete as (
      select t.*, lag(value) over (order by date) as prev_value
      from t
     )
select *
from todelete
where value <> prev_value or prev_value is null;

The where clause is just the inverse of the where clause in the first query, taking NULL values into account.

Upvotes: 2

Lamak
Lamak

Reputation: 70638

Here is a somewhat convoluted way to do it:

WITH CTE AS
(
    SELECT  *, 
            ROW_NUMBER() OVER(ORDER BY [Date]) RN1,
            ROW_NUMBER() OVER(PARTITION BY Value ORDER BY [Date]) RN2
    FROM dbo.YourTable
), CTE2 AS
(
    SELECT *, ROW_NUMBER() OVER(PARTITION BY Value, RN1 - RN2 ORDER BY [Date]) N
    FROM CTE
)
SELECT *
FROM CTE2
ORDER BY ID;

The results are:

╔════╦════════════╦═══════╦═════╦═════╦═══╗
║ ID ║    Date    ║ Value ║ RN1 ║ RN2 ║ N ║
╠════╬════════════╬═══════╬═════╬═════╬═══╣
║  1 ║ 2015-01-01 ║     1 ║   1 ║   1 ║ 1 ║
║  2 ║ 2015-01-02 ║     1 ║   2 ║   2 ║ 2 ║
║  3 ║ 2015-01-03 ║     2 ║   3 ║   1 ║ 1 ║
║  4 ║ 2015-01-05 ║     2 ║   4 ║   2 ║ 2 ║
║  5 ║ 2015-01-06 ║     1 ║   5 ║   3 ║ 1 ║
║  6 ║ 2015-01-08 ║     1 ║   6 ║   4 ║ 2 ║
║  7 ║ 2015-01-09 ║     1 ║   7 ║   5 ║ 3 ║
║  8 ║ 2015-01-10 ║     2 ║   8 ║   3 ║ 1 ║
║  9 ║ 2015-01-11 ║     3 ║   9 ║   1 ║ 1 ║
║ 10 ║ 2015-01-12 ║     3 ║  10 ║   2 ║ 2 ║
╚════╩════════════╩═══════╩═════╩═════╩═══╝

To delete the rows you don't want, you just need to do:

DELETE FROM CTE2
WHERE N > 1;

Upvotes: 5

Related Questions