Kieran Benton
Kieran Benton

Reputation: 8890

SQL Server cross-row compression

I'm having to return ~70,000 rows of 4 columns of INTs in a specific order and can only use very shallow caching as the data involved is highly volatile and has to be up to date. One property of the data is that it is often highly repetitive when it is in order.

I've started to look at various methods of reducing the row count in order to reduce network bandwidth and client side processing time/resources, but have not managed to find any kind of technique in T-SQL where I can 'compress' repetative rows down into a single row and a 'count' column. e.g.

prop1    prop2    prop3    prop4
--------------------------------
0        0        1        53
0        0        2        55
1        1        1        8
1        1        1        8
1        1        1        8
1        1        1        8
0        0        2        55
0        0        2        55
0        0        1        53

Into:

prop1    prop2    prop3    prop4    count
-----------------------------------------
0        0        1        53       1
0        0        2        55       1
1        1        1        8        4
0        0        2        55       2
0        0        1        53       1

I'd estimate that if this was possible, in many cases what would be a 70,000 row result set would be down to a few thousand at most.

Am I barking up the wrong tree here (is there implicit compression as part of the SQL Server protocol)?

Is there a way to do this (SQL Server 2005)?

Is there a reason I shouldn't do this?

Thanks.

Upvotes: 1

Views: 434

Answers (3)

Tim
Tim

Reputation: 5421

70,000 rows of four integer columns is not really a worry for bandwidth on a modern LAN, unless you have many workstations executing this query concurrently; and on a WAN with more restricted bandwidth you could use DISTINCT to eliminate duplicate rows, an approach which would be frugal with your bandwidth but consume some server CPU. Again, however, unless you have a really overloaded server that is always performing at or near peak loads, this additional consumption would be a mere blip. 70,000 rows is next to nothing.

Upvotes: 0

David Andres
David Andres

Reputation: 31781

This will work, though it is painful to look at:

;WITH Ordering
AS
(
  SELECT Prop1,        
  Prop2,        
  Prop3,        
  Prop4,
  ROW_NUMBER() OVER (ORDER BY Y, X) RN
  FROM Props
)
SELECT 
  CurrentRow.Prop1, 
  CurrentRow.Prop2, 
  CurrentRow.Prop3, 
  CurrentRow.Prop4, 
  CurrentRow.RN - 
    ISNULL((SELECT TOP 1 RN FROM Ordering O3 WHERE RN < CurrentRow.RN AND (CurrentRow.Prop1 <> O3.Prop1 OR CurrentRow.Prop2 <> O3.Prop2 OR CurrentRow.Prop3 <> O3.Prop3 OR CurrentRow.Prop4 <> O3.Prop4) ORDER BY RN DESC), 0) Repetitions
FROM Ordering CurrentRow
LEFT JOIN Ordering O2 ON CurrentRow.RN + 1 = O2.RN
WHERE O2.RN IS NULL OR (CurrentRow.Prop1 <> O2.Prop1 OR CurrentRow.Prop2 <> O2.Prop2 OR CurrentRow.Prop3 <> O2.Prop3 OR CurrentRow.Prop4 <> O2.Prop4) 
ORDER BY CurrentRow.RN

The gist is the following:

  1. Enumerate each row using ROW_NUMBER OVER to get the correct order.
  2. Find the maximums per cycle by joining only when the next row has different fields or when the next row does not exist.
  3. Figure out the count of repetitions is by taking the current row number (presumed to be the max for this cycle) and subtracting from it the maximum row number of the previous cycle, if it exists.

Upvotes: 1

Eric
Eric

Reputation: 95133

You can use the count function! This will require you to use the group by clause, where you tell count how to break up, or group, itself. Gropu by is used for any aggregate function in SQL.

select
    prop1,
    prop2,
    prop3,
    prop4,
    count(*) as count
from
    tbl
group by
    prop1,
    prop2,
    prop3,
    prop4,
    y,
    x
order by y, x

Update: The OP mentioned these are ordered by y and x, not part of the result set. In this case, you can still use y and x as part of the group by.

Keep in mind that order means nothing if it doesn't have ordering columns, so in this case, we have to respect that with y and x in the group by.

Upvotes: 2

Related Questions