Reputation: 256741

How to select rows, and nearby rows

Background

I have a table of values that some need attention:

| ID      | AddedDate   |
|---------|-------------|
|       1 | 2010-04-01  |
|       2 | 2010-04-01  |
|       3 | 2010-04-02  |
|       4 | 2010-04-02  |
|       5 | NULL        | <----------- needs attention
|       6 | 2010-04-02  |
|       7 | 2010-04-03  |
|       8 | 2010-04-04  |
|       9 | 2010-04-04  |
| 2432659 | 2016-06-15  |
| 2432650 | 2016-06-16  |
| 2432651 | 2016-06-17  |
| 2432672 | 2016-06-18  |
| 2432673 | NULL        | <----------- needs attention
| 2432674 | 2016-06-20  |
| 2432685 | 2016-06-21  |

I want to select the rows where AddedDate is null, and i want to select rows around it. In this example question it would be sufficient to say rows where the ID is ±3. This means i want:

| ID      | AddedDate   |
|---------|-------------|
|       2 | 2010-04-01  | ─╮
|       3 | 2010-04-02  |  │
|       4 | 2010-04-02  |  │
|       5 | NULL        |  ├──ID values ±3
|       6 | 2010-04-02  |  │
|       7 | 2010-04-03  |  │
|       8 | 2010-04-04  | ─╯

| 2432672 | 2016-06-18  | ─╮
| 2432673 | NULL        |  ├──ID values ±3
| 2432674 | 2016-06-20  | ─╯

Note: In reality it's a table of 9M rows, and 15k need attention.

Attempts

First i create a query that builds the ranges i'm interested in returning:

SELECT
  ID-3 AS [Low ID],
  ID+3 AS [High ID]
FROM Items
WHERE AddedDate IS NULL

Low ID   High ID
-------  -------
2        8 
2432670  2432676

So my initial attempt to use this does work:

WITH dt AS (
   SELECT ID-3 AS Low, ID+3 AS High
   FROM Items
   WHERE AddedDate IS NULL
)
SELECT * FROM Items
WHERE EXISTS(
    SELECT 1 FROM dt
    WHERE Items.ID BETWEEN dt.Low AND dt.High)

But when i try it on real data:

9 million total rows
15,000 interesting rows
subtree cost of 63,318,400
it takes hours (before i give up and cancel it)

There's probably a more efficient way.

Bonus Reading

Upvotes: 18

Answers (4)

Jorge Y.

Reputation: 1133

Just to try a different approach than other answers... How about using a table variable to store the ids you want. Then you join. My hope is that the insert executes fast enough, and then the SELECT can take advantage of the clustered index in Items. Unfortunately I don't have here your amount of data to test its efficiency:

DECLARE @userData TABLE(
    idInRange int NOT NULL
)

INSERT INTO @userData (idInRange)
SELECT DISTINCT i.Id + r 
FROM Items i 
CROSS JOIN (
  SELECT -3 as r UNION ALL SELECT -2 as r UNION ALL SELECT -1 as r UNION ALL SELECT 0 as r UNION ALL
  SELECT 1 as r UNION ALL SELECT 2 as r UNION ALL SELECT 3 as r 
) yourRange
WHERE AddedDate IS NULL;

SELECT i.*
FROM @userData u
INNER JOIN Items i ON i.ID = u.idInRange

Edited to add a DISTINCT when filling the table variable, to avoid duplicated rows just in case there are two contiguous NULL dates and their id ranges overlap

Upvotes: 1

dnoeth

Reputation: 60472

This is your existing logic rewritten using an moving max:

WITH dt AS (
   SELECT
      ID, AddedDate,
      -- check if there's a NULL within a range of +/- 3 rows
      -- and remember it's ID 
      max(case when AddedDate is null then id end)
      over (order by id 
            rows between 3 preceding and 3 following) as NullID
   FROM Items 
)
SELECT *
FROM dt
where id between NullID-3 and NullID+3

Upvotes: 4

i486

Reputation: 6563

Another way:

SELECT i1.*
    FROM Items i1, Items i2
        WHERE i2.AddedDate IS NULL AND ABS(i1.ID - i2.ID) <= 3

I hope there is index on AddedDate column.

Upvotes: 3

Gordon Linoff

Reputation: 1269913

Here is one method that uses the windowing clause:

select i.*
from (select i.*,
             count(*) over (order by id rows between 3 preceding and 1 preceding) as cnt_prec,
             count(*) over (order by id rows between 1 following and 3 following) as cnt_foll,
             count(addeddate) over (order by id rows between 3 preceding and 1 preceding) as cnt_ad_prec,
             count(addeddate) over (order by id rows between 1 following and 3 following) as cnt_ad_foll
      from items
     ) i
where cnt_ad_prec <> cnt_prec or
      cnt_ad_foll <> cnt_foll or
      addeddate is null;
order by id;

This returns all rows that have NULL in the column or are within three rows of a NULL.

The need for the comparison to the count is to avoid the edge issues on the smallest and largest ids.