Haus
Haus

Reputation: 1492

SQL Server : efficient way to find missing Ids

I am using SQL Server to store tens of millions of records. I need to be able to query its tables to find missing rows where there are gaps in the Id column, as there should be none.

I am currently using a solution that I have found here on StackOverflow:

CREATE PROCEDURE [dbo].[find_missing_ids]
    @Table NVARCHAR(128)
AS
BEGIN
    DECLARE @query NVARCHAR(MAX)
    SET @query = 'WITH Missing (missnum, maxid) '
+ N'AS '
+ N'('
+ N' SELECT 1 AS missnum, (select max(Id) from ' + @Table + ') '
+ N'    UNION ALL '
+ N'    SELECT missnum + 1, maxid FROM Missing '
+ N'    WHERE missnum < maxid '
+ N') '
+ N'SELECT missnum '
+ N'FROM Missing '
+ N'LEFT OUTER JOIN ' + @Table + ' tt on tt.Id = Missing.missnum '
+ N'WHERE tt.Id is NULL '
+ N'OPTION (MAXRECURSION 0);';

    EXEC sp_executesql @query
END;

This solution has been working very well, but it has been getting slower and more resource intensive as the tables have grown. Now, running the procedure on a table of 38 million rows is taking about 3.5 minutes and lots of CPU.

Is there a more efficient way to perform this? After a certain range has been found to not contain any missing Ids, I no longer need to check that range again.

Upvotes: 1

Views: 1946

Answers (3)

David Dubois
David Dubois

Reputation: 3932

JBJ's answer is almost complete. The query needs to return the From and Through for each range of missing values.

select B+1 as [From],A-1 as[Through]from
(select StuffID as A, 
lag(StuffID)over(order by StuffID)as B from Stuff)z
where A<>B+1
order by A

I created a test table with 50 million records, then deleted a few. The first row of the result is:

From   Through
33     35

This indicates that all IDs in the range from 33 through 35 are missing, i.e. 33, 34 and 35.

On my machine the query took 37 seconds.

Upvotes: 6

PSK
PSK

Reputation: 17943

Try this solution, it will be faster than CTE.

;WITH CTE AS
(
SELECT ROW_NUMBER() 
         OVER ( 
           ORDER BY (SELECT NULL)) RN 
FROM   ( values  (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) v(id) --10 ROWS 
       CROSS JOIN ( values  (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) v1(id)--100 ROWS 
       CROSS JOIN ( values  (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) v2(id) --1000 ROWS 
       CROSS JOIN ( values  (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) v3(id) --10000 ROWS 
       CROSS JOIN ( values  (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) v4(id)--100000 ROWS 
       CROSS JOIN ( values  (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) v5(id)--1000000 ROWS 
)

SELECT RN AS Missing 
FROM CTE C
LEFT JOIN YOURABLE T ON T.ID=R.ID
WHERE T.ID IS NULL

If you want you can use master..[spt_values] also to generate the number like following.

 SELECT (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) RN 
        FROM   master..[spt_values] T1
        CROSS JOIN (select top 500 * from master..[spt_values]) T2

Above query will generate 1268500 numbers

Note: You need to add the CROSS JOIN as per your requirement.

Upvotes: 1

JBJ
JBJ

Reputation: 403

try

select pId 
from (select Id, lag(Id) over (order by Id) pId from yourschema.yourtable) e
where pId <> (Id-1)
order by Id

replacing yourschema.yourtable with the appropriate table information

Upvotes: 1

Related Questions