Alex Zhukovskiy
Alex Zhukovskiy

Reputation: 10015

Aggregate data from multiple rows into single row

In my table each row has some data columns Priority column (for example, timestamp or just an integer). I want to group my data by ID and then in each group take latest not-null column. For example I have following table:

id  A       B       C       Priority
1   NULL    3       4       1
1   5       6       NULL    2
1   8       NULL    NULL    3
2   634     346     359     1
2   34      NULL    734     2

Desired result is :

id  A   B   C   
1   8   6   4   
2   34  346 734 

In this example table is small and has only 5 columns, but in real table it will be much larger. I really want this script to work fast. I tried do it myself, but my script works for SQLSERVER2012+ so I deleted it as not applicable.

Numbers: table could have 150k of rows, 20 columns, 20-80k of unique ids and average SELECT COUNT(id) FROM T GROUP BY ID is 2..5

Now I have a working code (thanks to @ypercubeᵀᴹ), but it runs very slowly on big tables, in my case script can take one minute or even more (with indices and so on).

How can it be speeded up?

SELECT 
    d.id,
    d1.A,
    d2.B,
    d3.C
FROM 
    ( SELECT id
      FROM T
      GROUP BY id
    ) AS d
  OUTER APPLY
    ( SELECT TOP (1) A
      FROM T 
      WHERE id = d.id
        AND A IS NOT NULL
      ORDER BY priority DESC
    ) AS d1 
  OUTER APPLY
    ( SELECT TOP (1) B
      FROM T 
      WHERE id = d.id
        AND B IS NOT NULL
      ORDER BY priority DESC
    ) AS d2 
  OUTER APPLY
    ( SELECT TOP (1) C
      FROM T 
      WHERE id = d.id
        AND C IS NOT NULL
      ORDER BY priority DESC
    ) AS d3 ;

In my test database with real amount of data I get following execution plan: enter image description here

Upvotes: 3

Views: 1773

Answers (4)

Gordon Linoff
Gordon Linoff

Reputation: 1269763

One alternative that might be faster is a multiple join approach. Get the priority for each column and then join back to the original table. For the first part:

select id,
       max(case when a is not null then priority end) as pa,
       max(case when b is not null then priority end) as pb,
       max(case when c is not null then priority end) as pc
from t
group by id;

Then join back to this table:

with pabc as (
      select id,
             max(case when a is not null then priority end) as pa,
             max(case when b is not null then priority end) as pb,
             max(case when c is not null then priority end) as pc
      from t
      group by id
     )
select pabc.id, ta.a, tb.b, tc.c
from pabc left join
     t ta
     on pabc.id = ta.id and pabc.pa = ta.priority left join
     t tb
     on pabc.id = tb.id and pabc.pb = tb.priority left join
     t tc
     on pabc.id = tc.id and pabc.pc = tc.priority ;

This can also take advantage of an index on t(id, priority).

Upvotes: 2

JassyJov
JassyJov

Reputation: 204

previous code will work with following syntax:

 with pabc as (
          select id,
                 max(case when a is not null then priority end) as pa,
                 max(case when b is not null then priority end) as pb,
                 max(case when c is not null then priority end) as pc
          from t
          group by id
         )
    select pabc.Id,ta.a, tb.b, tc.c
    from pabc 
         left join t ta on pabc.id = ta.id and  pabc.pa = ta.priority 
         left join t tb on pabc.id = tb.id and pabc.pb = tb.priority 
         left join t tc on pabc.id = tc.id and pabc.pc = tc.priority ;

Upvotes: 0

t-clausen.dk
t-clausen.dk

Reputation: 44326

This should do the trick, everything raised to the power 0 will return 1 except null:

DECLARE @t table(id int,A int,B  int,C int,Priority int)
INSERT @t
VALUES (1,NULL,3   ,4   ,1),
(1,5   ,6   ,NULL,2),(1,8   ,NULL,NULL,3),
(2,634 ,346 ,359 ,1),(2,34  ,NULL,734 ,2)

;WITH CTE as
(
  SELECT id, 
  CASE WHEN row_number() over 
    (partition by id order by Priority*power(A,0) desc) = 1 THEN A END A,
  CASE WHEN row_number() over 
    (partition by id order by Priority*power(B,0) desc) = 1 THEN B END B,
  CASE WHEN row_number() over 
    (partition by id order by Priority*power(C,0) desc) = 1 THEN C END C
  FROM @t
)
SELECT id, max(a) a, max(b) b, max(c) c
FROM CTE
GROUP BY id

Result:

id  a   b   c
1   8   6   4
2   34  346 734

Upvotes: 4

Thorsten Kettner
Thorsten Kettner

Reputation: 94914

This looks rather strange. You have a log table for all column changes, but no associated table with current data. Now you are looking for a query to collect your current values from the log table, which is a laborious task naturally.

The solution is simple: have an additional table with the current data. You can even link the tables with a trigger (so either every time a record gets inserted in your log table you update the current table or everytime a change is written to the current table you write a log entry).

Then just query your current table:

select id, a, b, c from currenttable order by id;

Upvotes: -1

Related Questions