sqlsql-serversql-server-2008-r2aggregategaps-and-islands

Reputation: 184

Finding Islands With Group and Interval Data

I'm using SQL Server 2008R2 in this problem. Here's an example dataset:

WIRE_ID     FROM    TO      CLASS
05485       0.000   1.520   PL
05485       1.520   3.050   PL
05485       3.050   22.250  SL
05485       3.050   22.250  SP
05485       22.250  33.530  SL
05485       22.250  33.530  QT
05485       33.530  43.580  QT
05485       43.580  52.580  PL
05485       52.580  57.910  QT
114161      0.000   3.000   SW
114161      3.000   5.000   SL
114161      5.000   6.000   SL
114161      6.000   9.412   YN
114161      9.412   10.549  YN
114161      10.549  12.375  CM
114161      12.375  14.438  SL
114161      14.438  15.126  SL

So, a non-sequential ID associated ranged values and a group/classification. As you can see you can sometimes have duplicate intervals as different classes may be applied. Ultimately the result I'd like to achieve would look like the following:

WIRE_ID     FROM    TO      CLASS
05485       0.000   3.050   PL
05485       3.050   22.250  SL
05485       3.050   22.250  SP
05485       22.250  33.530  SL
05485       22.250  43.580  QT
05485       43.580  52.580  PL
05485       52.580  57.910  QT
114161      0.000   3.000   SW
114161      3.000   6.000   SL
114161      6.000   10.549  YN
114161      10.549  12.375  CM
114161      12.375  15.126  SL

Seems easy at first and I've constructed a solution that works, but once I apply it to the entire data-set it grinds to a halt. Ideally I need a solution that can handle a million rows of this style of data in a more or less efficient manner... Here's my solution:

Declare @WIRE_CLASS Table(WIRE_ID varchar(25), [FROM] float, [TO] float, CLASS varchar(15));
Insert @WIRE_CLASS(WIRE_ID, [FROM], [TO], CLASS) Values
('05485',0.000,1.520,'PL'),
('05485',1.520,3.050,'PL'),
('05485',3.050,22.250,'SL'),
('05485',3.050,22.250,'SP'),
('05485',22.250,33.530,'SL'),
('05485',22.250,33.530,'QT'),
('05485',33.530,43.580,'QT'),
('05485',43.580,52.580,'PL'),
('05485',52.580,57.910,'QT'),
('114161',0.000,3.000,'SW'),
('114161',3.000,5.000,'SL'),
('114161',5.000,6.000,'SL'),
('114161',6.000,9.412,'YN'),
('114161',9.412,10.549,'YN'),
('114161',10.549,12.375,'CM'),
('114161',12.375,14.438,'SL'),
('114161',14.438,15.126,'SL');

;with WIRE AS (
SELECT
    WIRE_ID, 
    FROM,
    TO,
    CLASS
FROM 
    WIRE_CLASS
), ISLANDS AS (
SELECT
    ROW_NUMBER() OVER (ORDER BY WI.WIRE_ID, WI.FROM) ID,
    WI.WIRE_ID, 
    WI.FROM,
    WI.TO,
    WI.CLASS,
    CASE WHEN WI2.WIRE_ID IS NULL THEN 1 ELSE 0 END BREAKER
FROM 
    WIRE WI 
    LEFT JOIN WIRE WI2 ON
        WI2.WIRE_ID = WI.WIRE_ID
        AND (WI2.TO = WI.FROM) 
        AND WI2.CLASS = WI.CLASS

), DATA AS(
SELECT 
    IS1.WIRE_ID, IS1.FROM, IS1.TO, IS1.CLASS,
    (SELECT sum(BREAKER) FROM ISLANDS IS2 WHERE IS1.ID >=  IS2.ID) BREAKER
FROM ISLANDS IS1
)
SELECT 
    DA.WIRE_ID,
    MIN(DA.FROM),
    MAX(DA.TO),
    MIN(DA.CLASS)
FROM DATA DA
GROUP BY 
    DA.WIRE_ID, 
    BREAKER,
    DA.CLASS
ORDER BY 
    DA.WIRE_ID,
    MIN(DA.[FROM]),
    MAX(DA.[TO])

Can you suggest a better way to do this??? Thanks a bunch SQL gurus!

Upvotes: 4

Answers (6)

Scott C

Reputation: 1660

As others have said, using a temp table instead of a table variable will make a world of difference to the performance. I assume however, you've only used the table variable to pose your example and in reality you have a proper user table that contains all this data.

My query will find the islands, but will provide slightly different output to your results from your sample data. My query will not split islands as it appears yours has in some instances... I see a discrepancy in your results between the way the SL/SP/QT results are handled for wire 05485. I think

WIRE_ID FROM TO CLASS 05485 3.050 22.250 SL 05485 3.050 22.250 SP 05485 22.250 33.530 SL 05485 22.250 33.530 QT 05485 33.530 43.580 QT

should result in

WIRE_ID FROM TO CLASS 05485 3.050 33.530 SL 05485 3.050 22.250 SP 05485 22.250 43.580 QT

not

WIRE_ID FROM TO CLASS 05485 3.050 22.250 SL 05485 3.050 22.250 SP 05485 22.250 33.530 SL 05485 22.250 43.580 QT

Your results have split the SL island, but not the QT. Even though there is from/to pairs matching between SL & SP (3.05/22.5) and then SL & QT (22.5/33.53).

I've run the query below on a VM on my laptop (read: a server should be quicker) and for 1M records, with ~2000 Wire/Class combinations and ~20% rows needing to have from/to combined. I've randomly generated the data a few times and it typically takes only between 80 and 100 seconds.

I've created the table as:

IF OBJECT_ID('tempdb.dbo.#WIRE_CLASS', 'U') IS NOT NULL DROP TABLE #WIRE_CLASS GO CREATE TABLE #WIRE_CLASS ( WIRE_ID varchar(25), [FROM] float, [TO] float, CLASS varchar(15), PRIMARY KEY (WIRE_ID, [FROM], CLASS) )

Here's the query, with explanations in comments

-- The Cross Join ensures we always have a pair of first and last from/to pairs -- The left join matches all from=to combinations, -- allowing the where clause to restrict to just the first and last -- These first/last pairs are then grouped in the CTE -- The final select is then quite simple ; With GroupedData AS ( SELECT (Row_Number() OVER (ORDER BY W1.WIRE_ID, W1.CLASS, W1.[FROM]) - 1) / 2 Grp, W1.WIRE_ID, W1.[FROM], W1.[TO], W1.CLASS FROM #WIRE_CLASS W1 CROSS JOIN (SELECT 0 AS [First] UNION SELECT 1) SetOrder LEFT OUTER JOIN #WIRE_CLASS W2 ON W1.WIRE_ID = W2.WIRE_ID AND W1.CLASS = W2.CLASS AND ((W1.[TO] = W2.[FROM] AND [First] = 0) OR (W2.[TO] = W1.[FROM] AND [First] = 1)) WHERE W2.WIRE_ID IS NULL ) SELECT WIRE_ID, MIN([FROM]) AS [FROM], MAX([TO]) AS [TO], CLASS FROM GroupedData GROUP BY Grp, WIRE_ID, CLASS ORDER BY WIRE_ID, [FROM], CLASS

Upvotes: 4

Tanner

Reputation: 22753

Using temporary tables instead of CTE's with large datasets will improve the performance, but without your dataset I'm unable to test the performance of this.

Here's your queries using temp tables:

Declare @WIRE_CLASS Table(WIRE_ID varchar(25), 
                          [FROM] float, 
                          [TO] float, 
                          CLASS varchar(15));

Insert @WIRE_CLASS(WIRE_ID, [FROM], [TO], CLASS) 
Values ('05485',0.000,1.520,'PL'),
       ('05485',1.520,3.050,'PL'),
       ('05485',3.050,22.250,'SL'),
       ('05485',3.050,22.250,'SP'),
       ('05485',22.250,33.530,'SL'),
       ('05485',22.250,33.530,'QT'),
       ('05485',33.530,43.580,'QT'),
       ('05485',43.580,52.580,'PL'),
       ('05485',52.580,57.910,'QT'),
       ('114161',0.000,3.000,'SW'),
       ('114161',3.000,5.000,'SL'),
       ('114161',5.000,6.000,'SL'),
       ('114161',6.000,9.412,'YN'),
       ('114161',9.412,10.549,'YN'),
       ('114161',10.549,12.375,'CM'),
       ('114161',12.375,14.438,'SL'),
       ('114161',14.438,15.126,'SL');

SELECT
    WIRE_ID, 
    [FROM],
    [TO],
    CLASS
INTO #tmp_WIRE
FROM 
    @WIRE_CLASS

SELECT
    ROW_NUMBER() OVER (ORDER BY WI.WIRE_ID, WI.[FROM]) ID,
    WI.WIRE_ID, 
    WI.[FROM],
    WI.[TO],
    WI.CLASS,
    CASE WHEN WI2.WIRE_ID IS NULL THEN 1 ELSE 0 END  as BREAKER
INTO #tmp_ISLANDS
FROM 
    #tmp_WIRE WI 
    LEFT JOIN #tmp_WIRE WI2 ON
        WI2.WIRE_ID = WI.WIRE_ID
        AND (WI2.[TO] = WI.[FROM]) 
        AND WI2.CLASS = WI.CLASS

SELECT 
    IS1.WIRE_ID, IS1.[FROM], IS1.[TO], IS1.CLASS,
    (SELECT sum(BREAKER) FROM #tmp_ISLANDS IS2 WHERE IS1.ID >=  IS2.ID) BREAKER
INTO #tmp_DATA
FROM #tmp_ISLANDS IS1

SELECT 
    DA.WIRE_ID,
    MIN(DA.[FROM]),
    MAX(DA.[TO]),
    MIN(DA.CLASS)
FROM #tmp_DATA DA
GROUP BY 
    DA.WIRE_ID, 
    BREAKER,
    DA.CLASS
ORDER BY 
    DA.WIRE_ID,
    MIN(DA.[FROM]),
    MAX(DA.[TO])

What's the difference between a CTE and a Temp Table?

Upvotes: 1

SouravA

Reputation: 5253

Try this:

select wire_id, class, [from], [to], 
ROW_NUMBER() over (order by wire_id, class,[from], [to]) gg,
1 as is_continous
into #temp1
from #WIRE_CLASS
group by wire_id, class, [from], [to]

--Including index to improve performance of selects
create index IX_temp1 on #temp1(gg) include (is_continous)
create index IX_temp2 on #temp1(wire_id, class) include (is_continous)

update this
set is_continous = 0 
from #temp1 prev join #temp1 this on prev.gg = this.gg -1
where prev.[to] <> this.[from]



update this
set this.[from] = prev.[from] 
--select prev.*, this.* 
from #temp1 prev join #temp1 this 
on prev.gg <= this.gg and this.gg <> 1 and prev.wire_id = this.WIRE_ID 
and prev.CLASS = this.CLASS and this.is_continous = 1

update prev
set prev.[to] = this.[to]
--select prev.*, this.* 
from #temp1 prev join #temp1 this 
on prev.gg <= this.gg and this.gg <> 1 and prev.wire_id = this.WIRE_ID 
and prev.CLASS = this.CLASS and this.is_continous = 1 and this.gg <> 1

select distinct wire_id, [from], [to], class
from #temp1

Upvotes: 0

souplex

Reputation: 981

You can use the OVER() clause of the MIN and MAX statements.

Below you will find a working sample (including the generation of a decent test data set).

-- create a temp table to hold a decent amount of testdata
IF OBJECT_ID('tempdb..#testdata') IS NOT NULL 
    DROP TABLE #testdata;

CREATE TABLE #testdata
    (
      Id INT PRIMARY KEY ,
      WIRE_ID INT NOT NULL ,
      _FROM FLOAT NOT NULL ,
      _TO FLOAT NOT NULL ,
      CLASS VARCHAR(15) NOT NULL
    );

-- create some testdata
-- : 1KK records, divided over 10K WIRE_ID's (so 100 values per wire-id)
-- : FROM-TO values randomly divided into 4 value ranges [0-25][26-50][51-75][76-100]
-- : CLASS  bucketnumber + randomly classification from 1 to 5 (cls_1_2 => cls_[bucketnum]_[rand between 1 and 5]
DECLARE @nrows INT = 1000000;
DECLARE @nwires INT = 100;
DECLARE @numbuckets INT = 4;
DECLARE @bucketsize INT= 25000;
DECLARE @bucketstartMax INT= 5000;
DECLARE @clsMax INT = 5;

WITH    TestRecs
          AS ( SELECT TOP ( @nrows )
                        ROW_NUMBER() OVER ( ORDER BY ( SELECT 1) ) AS Rnum ,
                        bucketNumber = ABS(CHECKSUM(NEWID()) % @numbuckets) ,
                        fromStart = ABS(CHECKSUM(NEWID()) % @bucketstartMax) ,
                        fromIncrement = ABS(CHECKSUM(NEWID()) % ( @bucketsize- ABS(CHECKSUM(NEWID())% @bucketstartMax) )) ,
                        clsNum = ABS(CHECKSUM(NEWID()) % @clsMax) + 1
               FROM     sys.all_columns c1
                        CROSS JOIN sys.all_columns c2
                        CROSS JOIN sys.all_columns c3
             )
    INSERT  INTO #testdata
            ( Id ,
              WIRE_ID ,
              [_FROM] ,
              [_TO] ,
              CLASS 
            )
            SELECT  RNum AS _id ,
                    CAST(RNum / @nwires + 1 AS INT) AS Wire_ID ,
                    _from = ( ( bucketNumber * @bucketsize ) + fromStart ) / CAST(1000 AS FLOAT) ,
                    _to = ( ( bucketNumber * @bucketsize ) + fromStart + fromIncrement ) / CAST(1000 AS FLOAT) ,
                    _cls = 'cls' + CAST(bucketNumber AS VARCHAR(3)) + '_' + CAST(clsNum AS VARCHAR(5))
            FROM    TestRecs;

-- selecting the results can be achieved using OVER() clauses on MIN() and MAX()
-- note this forces you to include _from and _to in GROUP BY
-- , so you need a 2nd select is required to get the distict list
WITH    Bucketized
          AS ( SELECT   WIRE_ID ,
                        CLASS ,
                        MIN(_FROM) OVER ( PARTITION BY WIRE_ID, CLASS ) AS _FROM ,
                        MAX(_TO) OVER ( PARTITION BY WIRE_ID, CLASS ) AS _TO
               FROM     #testdata
               GROUP BY WIRE_ID ,
                        CLASS ,
                        _FROM ,
                        _TO
             )
    SELECT  WIRE_ID ,
            CLASS ,
            _FROM ,
            _TO
    FROM    Bucketized
    GROUP BY WIRE_ID ,
            CLASS ,
            _FROM ,
            _TO;

-- a sample with a million rows takes 6 seconds on my laptop

Upvotes: 0

Drew

Reputation: 2663

You would want the be sure that all keys are indexed. Additionally if you're matching up columns of different types that's going to yield a performance hit.

Additionally, where you have the row_number function that's going to change the whole to at least O(2n) efficiency, that would be in part, why you're seeing such long run times with your alg

And as Tanner had written, a move to temp tables would yield a performance improvement

Upvotes: 0

Neerav Kumar

Reputation: 84

Note: The query is in Oracle. You will have to convert it

The bottleneck in your case is multiple self join of such large temporary tables. A different approach could do this without joining. First lets try to group overlapping rows. For this I used a sequence and 2 functions. The functionality of the 2 functions is the same as nextval and currval. The query is

  SELECT WIRE_ID,
     FROM_,
     TO_,
     CLASS,
     CASE
        WHEN LAG (TO_, 1, -999999)
                OVER (PARTITION BY WIRE_ID, CLASS ORDER BY FROM_) >= FROM_
        THEN
           CURRENTVALUE ()
        ELSE
           NEXTVALUE ()
     END
        GRP
FROM WIRE_CLASS
ORDER BY WIRE_ID, CLASS, FROM_

The output is

"unable to upload image"

Now just use this output as inline view and aggregate. Query is

  SELECT WIRE_ID,
     MIN (FROM_) AS FROM_,
     MAX (TO_) AS TO_,
     CLASS
FROM (SELECT WIRE_ID,
             FROM_,
             TO_,
             CLASS,
             CASE
                WHEN LAG (TO_, 1, -999999)
                        OVER (PARTITION BY WIRE_ID, CLASS ORDER BY FROM_) >=
                        FROM_
                THEN
                   CURRENTVALUE ()
                ELSE
                   NEXTVALUE ()
             END
                AS GRP
        FROM WIRE_CLASS) BASE_Q
GROUP BY WIRE_ID, CLASS, GRP
ORDER BY WIRE_ID, FROM_;

The final output is

"unable to upload image"

The functions are used because Oracle does not allow sequences to be used in conditional statements. Here are the definition of functions

CREATE OR REPLACE FUNCTION NEXTVALUE
RETURN NUMBER
IS
VAL   NUMBER;
BEGIN
SELECT WIRE.NEXTVAL INTO VAL FROM DUAL;
RETURN VAL;
END;

CREATE OR REPLACE FUNCTION CURRENTVALUE
RETURN NUMBER
IS
VAL   NUMBER;
BEGIN
SELECT WIRE.CURRVAL INTO VAL FROM DUAL;
RETURN VAL;
END;

Hope it helps.

Upvotes: 0

Finding Islands With Group and Interval Data

Answers (6)

Related Questions