Reputation: 184
I'm using SQL Server 2008R2 in this problem. Here's an example dataset:
WIRE_ID FROM TO CLASS
05485 0.000 1.520 PL
05485 1.520 3.050 PL
05485 3.050 22.250 SL
05485 3.050 22.250 SP
05485 22.250 33.530 SL
05485 22.250 33.530 QT
05485 33.530 43.580 QT
05485 43.580 52.580 PL
05485 52.580 57.910 QT
114161 0.000 3.000 SW
114161 3.000 5.000 SL
114161 5.000 6.000 SL
114161 6.000 9.412 YN
114161 9.412 10.549 YN
114161 10.549 12.375 CM
114161 12.375 14.438 SL
114161 14.438 15.126 SL
So, a non-sequential ID associated ranged values and a group/classification. As you can see you can sometimes have duplicate intervals as different classes may be applied. Ultimately the result I'd like to achieve would look like the following:
WIRE_ID FROM TO CLASS
05485 0.000 3.050 PL
05485 3.050 22.250 SL
05485 3.050 22.250 SP
05485 22.250 33.530 SL
05485 22.250 43.580 QT
05485 43.580 52.580 PL
05485 52.580 57.910 QT
114161 0.000 3.000 SW
114161 3.000 6.000 SL
114161 6.000 10.549 YN
114161 10.549 12.375 CM
114161 12.375 15.126 SL
Seems easy at first and I've constructed a solution that works, but once I apply it to the entire data-set it grinds to a halt. Ideally I need a solution that can handle a million rows of this style of data in a more or less efficient manner... Here's my solution:
Declare @WIRE_CLASS Table(WIRE_ID varchar(25), [FROM] float, [TO] float, CLASS varchar(15));
Insert @WIRE_CLASS(WIRE_ID, [FROM], [TO], CLASS) Values
('05485',0.000,1.520,'PL'),
('05485',1.520,3.050,'PL'),
('05485',3.050,22.250,'SL'),
('05485',3.050,22.250,'SP'),
('05485',22.250,33.530,'SL'),
('05485',22.250,33.530,'QT'),
('05485',33.530,43.580,'QT'),
('05485',43.580,52.580,'PL'),
('05485',52.580,57.910,'QT'),
('114161',0.000,3.000,'SW'),
('114161',3.000,5.000,'SL'),
('114161',5.000,6.000,'SL'),
('114161',6.000,9.412,'YN'),
('114161',9.412,10.549,'YN'),
('114161',10.549,12.375,'CM'),
('114161',12.375,14.438,'SL'),
('114161',14.438,15.126,'SL');
;with WIRE AS (
SELECT
WIRE_ID,
FROM,
TO,
CLASS
FROM
WIRE_CLASS
), ISLANDS AS (
SELECT
ROW_NUMBER() OVER (ORDER BY WI.WIRE_ID, WI.FROM) ID,
WI.WIRE_ID,
WI.FROM,
WI.TO,
WI.CLASS,
CASE WHEN WI2.WIRE_ID IS NULL THEN 1 ELSE 0 END BREAKER
FROM
WIRE WI
LEFT JOIN WIRE WI2 ON
WI2.WIRE_ID = WI.WIRE_ID
AND (WI2.TO = WI.FROM)
AND WI2.CLASS = WI.CLASS
), DATA AS(
SELECT
IS1.WIRE_ID, IS1.FROM, IS1.TO, IS1.CLASS,
(SELECT sum(BREAKER) FROM ISLANDS IS2 WHERE IS1.ID >= IS2.ID) BREAKER
FROM ISLANDS IS1
)
SELECT
DA.WIRE_ID,
MIN(DA.FROM),
MAX(DA.TO),
MIN(DA.CLASS)
FROM DATA DA
GROUP BY
DA.WIRE_ID,
BREAKER,
DA.CLASS
ORDER BY
DA.WIRE_ID,
MIN(DA.[FROM]),
MAX(DA.[TO])
Can you suggest a better way to do this??? Thanks a bunch SQL gurus!
Upvotes: 4
Views: 596
Reputation: 1660
As others have said, using a temp table instead of a table variable will make a world of difference to the performance. I assume however, you've only used the table variable to pose your example and in reality you have a proper user table that contains all this data.
My query will find the islands, but will provide slightly different output to your results from your sample data. My query will not split islands as it appears yours has in some instances... I see a discrepancy in your results between the way the SL/SP/QT results are handled for wire 05485. I think
WIRE_ID FROM TO CLASS
05485 3.050 22.250 SL
05485 3.050 22.250 SP
05485 22.250 33.530 SL
05485 22.250 33.530 QT
05485 33.530 43.580 QT
should result in
WIRE_ID FROM TO CLASS
05485 3.050 33.530 SL
05485 3.050 22.250 SP
05485 22.250 43.580 QT
not
WIRE_ID FROM TO CLASS
05485 3.050 22.250 SL
05485 3.050 22.250 SP
05485 22.250 33.530 SL
05485 22.250 43.580 QT
Your results have split the SL island, but not the QT. Even though there is from/to pairs matching between SL & SP (3.05/22.5) and then SL & QT (22.5/33.53).
I've run the query below on a VM on my laptop (read: a server should be quicker) and for 1M records, with ~2000 Wire/Class combinations and ~20% rows needing to have from/to combined. I've randomly generated the data a few times and it typically takes only between 80 and 100 seconds.
I've created the table as:
IF OBJECT_ID('tempdb.dbo.#WIRE_CLASS', 'U') IS NOT NULL
DROP TABLE #WIRE_CLASS
GO
CREATE TABLE #WIRE_CLASS (
WIRE_ID varchar(25),
[FROM] float,
[TO] float,
CLASS varchar(15),
PRIMARY KEY (WIRE_ID, [FROM], CLASS)
)
Here's the query, with explanations in comments
-- The Cross Join ensures we always have a pair of first and last from/to pairs
-- The left join matches all from=to combinations,
-- allowing the where clause to restrict to just the first and last
-- These first/last pairs are then grouped in the CTE
-- The final select is then quite simple
; With GroupedData AS (
SELECT
(Row_Number() OVER (ORDER BY W1.WIRE_ID, W1.CLASS, W1.[FROM]) - 1) / 2 Grp,
W1.WIRE_ID, W1.[FROM], W1.[TO], W1.CLASS
FROM #WIRE_CLASS W1
CROSS JOIN (SELECT 0 AS [First] UNION SELECT 1) SetOrder
LEFT OUTER JOIN #WIRE_CLASS W2
ON W1.WIRE_ID = W2.WIRE_ID
AND W1.CLASS = W2.CLASS
AND ((W1.[TO] = W2.[FROM] AND [First] = 0)
OR (W2.[TO] = W1.[FROM] AND [First] = 1))
WHERE W2.WIRE_ID IS NULL
)
SELECT WIRE_ID, MIN([FROM]) AS [FROM], MAX([TO]) AS [TO], CLASS
FROM GroupedData
GROUP BY Grp, WIRE_ID, CLASS
ORDER BY WIRE_ID, [FROM], CLASS
Upvotes: 4
Reputation: 22753
Using temporary tables instead of CTE's with large datasets will improve the performance, but without your dataset I'm unable to test the performance of this.
Here's your queries using temp tables:
Declare @WIRE_CLASS Table(WIRE_ID varchar(25),
[FROM] float,
[TO] float,
CLASS varchar(15));
Insert @WIRE_CLASS(WIRE_ID, [FROM], [TO], CLASS)
Values ('05485',0.000,1.520,'PL'),
('05485',1.520,3.050,'PL'),
('05485',3.050,22.250,'SL'),
('05485',3.050,22.250,'SP'),
('05485',22.250,33.530,'SL'),
('05485',22.250,33.530,'QT'),
('05485',33.530,43.580,'QT'),
('05485',43.580,52.580,'PL'),
('05485',52.580,57.910,'QT'),
('114161',0.000,3.000,'SW'),
('114161',3.000,5.000,'SL'),
('114161',5.000,6.000,'SL'),
('114161',6.000,9.412,'YN'),
('114161',9.412,10.549,'YN'),
('114161',10.549,12.375,'CM'),
('114161',12.375,14.438,'SL'),
('114161',14.438,15.126,'SL');
SELECT
WIRE_ID,
[FROM],
[TO],
CLASS
INTO #tmp_WIRE
FROM
@WIRE_CLASS
SELECT
ROW_NUMBER() OVER (ORDER BY WI.WIRE_ID, WI.[FROM]) ID,
WI.WIRE_ID,
WI.[FROM],
WI.[TO],
WI.CLASS,
CASE WHEN WI2.WIRE_ID IS NULL THEN 1 ELSE 0 END as BREAKER
INTO #tmp_ISLANDS
FROM
#tmp_WIRE WI
LEFT JOIN #tmp_WIRE WI2 ON
WI2.WIRE_ID = WI.WIRE_ID
AND (WI2.[TO] = WI.[FROM])
AND WI2.CLASS = WI.CLASS
SELECT
IS1.WIRE_ID, IS1.[FROM], IS1.[TO], IS1.CLASS,
(SELECT sum(BREAKER) FROM #tmp_ISLANDS IS2 WHERE IS1.ID >= IS2.ID) BREAKER
INTO #tmp_DATA
FROM #tmp_ISLANDS IS1
SELECT
DA.WIRE_ID,
MIN(DA.[FROM]),
MAX(DA.[TO]),
MIN(DA.CLASS)
FROM #tmp_DATA DA
GROUP BY
DA.WIRE_ID,
BREAKER,
DA.CLASS
ORDER BY
DA.WIRE_ID,
MIN(DA.[FROM]),
MAX(DA.[TO])
Related Reading:
Which are more performant, CTE or temporary tables?
What's the difference between a CTE and a Temp Table?
Upvotes: 1
Reputation: 5253
Try this:
select wire_id, class, [from], [to],
ROW_NUMBER() over (order by wire_id, class,[from], [to]) gg,
1 as is_continous
into #temp1
from #WIRE_CLASS
group by wire_id, class, [from], [to]
--Including index to improve performance of selects
create index IX_temp1 on #temp1(gg) include (is_continous)
create index IX_temp2 on #temp1(wire_id, class) include (is_continous)
update this
set is_continous = 0
from #temp1 prev join #temp1 this on prev.gg = this.gg -1
where prev.[to] <> this.[from]
update this
set this.[from] = prev.[from]
--select prev.*, this.*
from #temp1 prev join #temp1 this
on prev.gg <= this.gg and this.gg <> 1 and prev.wire_id = this.WIRE_ID
and prev.CLASS = this.CLASS and this.is_continous = 1
update prev
set prev.[to] = this.[to]
--select prev.*, this.*
from #temp1 prev join #temp1 this
on prev.gg <= this.gg and this.gg <> 1 and prev.wire_id = this.WIRE_ID
and prev.CLASS = this.CLASS and this.is_continous = 1 and this.gg <> 1
select distinct wire_id, [from], [to], class
from #temp1
Upvotes: 0
Reputation: 981
You can use the OVER() clause of the MIN and MAX statements.
Below you will find a working sample (including the generation of a decent test data set).
-- create a temp table to hold a decent amount of testdata
IF OBJECT_ID('tempdb..#testdata') IS NOT NULL
DROP TABLE #testdata;
CREATE TABLE #testdata
(
Id INT PRIMARY KEY ,
WIRE_ID INT NOT NULL ,
_FROM FLOAT NOT NULL ,
_TO FLOAT NOT NULL ,
CLASS VARCHAR(15) NOT NULL
);
-- create some testdata
-- : 1KK records, divided over 10K WIRE_ID's (so 100 values per wire-id)
-- : FROM-TO values randomly divided into 4 value ranges [0-25][26-50][51-75][76-100]
-- : CLASS bucketnumber + randomly classification from 1 to 5 (cls_1_2 => cls_[bucketnum]_[rand between 1 and 5]
DECLARE @nrows INT = 1000000;
DECLARE @nwires INT = 100;
DECLARE @numbuckets INT = 4;
DECLARE @bucketsize INT= 25000;
DECLARE @bucketstartMax INT= 5000;
DECLARE @clsMax INT = 5;
WITH TestRecs
AS ( SELECT TOP ( @nrows )
ROW_NUMBER() OVER ( ORDER BY ( SELECT 1) ) AS Rnum ,
bucketNumber = ABS(CHECKSUM(NEWID()) % @numbuckets) ,
fromStart = ABS(CHECKSUM(NEWID()) % @bucketstartMax) ,
fromIncrement = ABS(CHECKSUM(NEWID()) % ( @bucketsize- ABS(CHECKSUM(NEWID())% @bucketstartMax) )) ,
clsNum = ABS(CHECKSUM(NEWID()) % @clsMax) + 1
FROM sys.all_columns c1
CROSS JOIN sys.all_columns c2
CROSS JOIN sys.all_columns c3
)
INSERT INTO #testdata
( Id ,
WIRE_ID ,
[_FROM] ,
[_TO] ,
CLASS
)
SELECT RNum AS _id ,
CAST(RNum / @nwires + 1 AS INT) AS Wire_ID ,
_from = ( ( bucketNumber * @bucketsize ) + fromStart ) / CAST(1000 AS FLOAT) ,
_to = ( ( bucketNumber * @bucketsize ) + fromStart + fromIncrement ) / CAST(1000 AS FLOAT) ,
_cls = 'cls' + CAST(bucketNumber AS VARCHAR(3)) + '_' + CAST(clsNum AS VARCHAR(5))
FROM TestRecs;
-- selecting the results can be achieved using OVER() clauses on MIN() and MAX()
-- note this forces you to include _from and _to in GROUP BY
-- , so you need a 2nd select is required to get the distict list
WITH Bucketized
AS ( SELECT WIRE_ID ,
CLASS ,
MIN(_FROM) OVER ( PARTITION BY WIRE_ID, CLASS ) AS _FROM ,
MAX(_TO) OVER ( PARTITION BY WIRE_ID, CLASS ) AS _TO
FROM #testdata
GROUP BY WIRE_ID ,
CLASS ,
_FROM ,
_TO
)
SELECT WIRE_ID ,
CLASS ,
_FROM ,
_TO
FROM Bucketized
GROUP BY WIRE_ID ,
CLASS ,
_FROM ,
_TO;
-- a sample with a million rows takes 6 seconds on my laptop
Upvotes: 0
Reputation: 2663
You would want the be sure that all keys are indexed. Additionally if you're matching up columns of different types that's going to yield a performance hit.
Additionally, where you have the row_number function that's going to change the whole to at least O(2n) efficiency, that would be in part, why you're seeing such long run times with your alg
And as Tanner had written, a move to temp tables would yield a performance improvement
Upvotes: 0
Reputation: 84
Note: The query is in Oracle. You will have to convert it
The bottleneck in your case is multiple self join of such large temporary tables. A different approach could do this without joining. First lets try to group overlapping rows. For this I used a sequence and 2 functions. The functionality of the 2 functions is the same as nextval and currval. The query is
SELECT WIRE_ID,
FROM_,
TO_,
CLASS,
CASE
WHEN LAG (TO_, 1, -999999)
OVER (PARTITION BY WIRE_ID, CLASS ORDER BY FROM_) >= FROM_
THEN
CURRENTVALUE ()
ELSE
NEXTVALUE ()
END
GRP
FROM WIRE_CLASS
ORDER BY WIRE_ID, CLASS, FROM_
The output is
"unable to upload image"
Now just use this output as inline view and aggregate. Query is
SELECT WIRE_ID,
MIN (FROM_) AS FROM_,
MAX (TO_) AS TO_,
CLASS
FROM (SELECT WIRE_ID,
FROM_,
TO_,
CLASS,
CASE
WHEN LAG (TO_, 1, -999999)
OVER (PARTITION BY WIRE_ID, CLASS ORDER BY FROM_) >=
FROM_
THEN
CURRENTVALUE ()
ELSE
NEXTVALUE ()
END
AS GRP
FROM WIRE_CLASS) BASE_Q
GROUP BY WIRE_ID, CLASS, GRP
ORDER BY WIRE_ID, FROM_;
The final output is
"unable to upload image"
The functions are used because Oracle does not allow sequences to be used in conditional statements. Here are the definition of functions
CREATE OR REPLACE FUNCTION NEXTVALUE
RETURN NUMBER
IS
VAL NUMBER;
BEGIN
SELECT WIRE.NEXTVAL INTO VAL FROM DUAL;
RETURN VAL;
END;
CREATE OR REPLACE FUNCTION CURRENTVALUE
RETURN NUMBER
IS
VAL NUMBER;
BEGIN
SELECT WIRE.CURRVAL INTO VAL FROM DUAL;
RETURN VAL;
END;
Hope it helps.
Upvotes: 0