VulgarBinary
VulgarBinary

Reputation: 3589

SELECT TOP 1 without ORDER BY Issue ignores WHERE statement

We have a table that contains unique codes. To generate a new unique code we are using an approach found in the below SQL statement and uncovered cases where the NOT EXISTS statement seemingly allows rows through that exist.

There is no concurrency issues as this was proven out in a sandbox using a single query being ran against SQL Server 2016. If we place the ORDER BY statement it suddenly works as expected. It appears as if without the ORDER BY that this query is conditionally ignoring the WHERE clause. In the event all codes collide I would expect @code to either be NULL or remain it's initial state of 0.

DECLARE @code int = 0;

    select  @code = Code from (
        SELECT top 1 randoms.Code
        FROM (
            VALUES 
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT))
        ) randoms (Code)
        WHERE NOT EXISTS (SELECT 1 FROM TEST_Codes uc WHERE uc.Code = randoms.Code)
    ) c;


    SELECT 
        c.code,
        ud.*
    FROM (VALUES (@code)) as c(Code)
    LEFT OUTER JOIN TEST_Codes ud
        ON ud.Code = c.Code

This statement will allow duplicates to be returned, which is baffling due to the WHERE NOT EXISTS statement.

If we change the definition of the view c to be ) c ORDER BY c.Code it suddenly works. Why is this?

Upvotes: 3

Views: 1102

Answers (2)

Martin Smith
Martin Smith

Reputation: 453067

Sql Server does not guarantee how many times it will execute compute scalars and similar expressions. It is possible the reference in the where is using a different value than the one selected but when you add an order by it materialises it and only calculates it once per row.

If you are on 2014 or above you can use an extended events session on query_trace_column_values to see this happening.

DECLARE @TestCodes TABLE(Code int)
dbcc traceon(2486);
set statistics xml on;

    select  Code from (
        SELECT randoms.Code
        FROM (
            VALUES 
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
            (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT))
        ) randoms (Code)
        WHERE NOT EXISTS (SELECT 1 FROM @TestCodes uc WHERE uc.Code = randoms.Code)
    ) c
     option(recompile);


set statistics xml off;
dbcc traceoff(2486);

enter image description here

enter image description here

The column Union1005 is output from the constant scan at the top right. It is also referenced again in the join predicate. At this point it is re-evaluated and returns a different number.

You may be able to hack around with the query and get it to only be evaluated once but nothing is guaranteed. The only 100% safe way is to materialise the random numbers up front (e.g. into a temp table) before doing the check so you are guaranteed that they aren't going to be recalculated and change under you.


An example of hacking about with the SQL to get a non guaranteed result is below. I would not use this as it has the disadvantages that it it still guarantees nothing and also even if it works if you pick the top 1 off it your "random" numbers will no longer be as well distributed. It introduces a bias for lower numbers.

select  Code from (
    SELECT TOP 5 randoms.Code
    FROM (
        VALUES 
        (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
        (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
        (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
        (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
        (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT))
    ) randoms (Code)
    order by Code
    ) T
    WHERE NOT EXISTS (SELECT 1 FROM @TestCodes uc WHERE uc.Code = T.Code)

enter image description here

This materialises it and the value output from the sort is the same as that used in the nested loops predicate.

enter image description here

Upvotes: 4

clifton_h
clifton_h

Reputation: 1298

The Hypothesis

You need a function that creates 5 random values that you want to ensure are not from a table that holds your code values.

We assume your TEST_CODE does not contain duplicates either, and together with the TOP 1 operator the developer can fairly assume his numbers will always be random.

THE PROBLEM

@Martin.Smith makes some good observations about the way this hack...and yes, it is a hack method. Functions do not allow RAND() function, and what you seek is not so much a solution as a workaround. Note their reason for denying NEWID()

Invalid use of side-effecting or time-dependent operator in 'newid()' within a function.

What will Microsoft say? They clearly did not intend to allow Non-deterministic functions so you may not be supported here anyways.

Instead, treat this like the Insert operation that it is. You only care about one random value where the only requirement is that it not match any of the existing codes, whether active or passive.

ADDED - Furthermore, you want to make your code portable, perhaps even in a job. Use a stored procedure instead.

Testing Methods:

CREATE TABLE Test_Codes (Code INT, ud NVARCHAR(50) )
GO

DECLARE @COUNT INT = 1;
WHILE (@Count < 100000)
    BEGIN
    INSERT INTO Test_Codes (Code, ud)
        VALUES (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)
                 , 'PoliceNumber[' + CAST(@Count AS NVARCHAR(7) ) + ']')
    SET @COUNT = @COUNT + 1 
    END
GO

An Alternative Solution

CREATE PROCEDURE dbo.USP_Code_INS (@Name NVARCHAR(100) )
AS
BEGIN
        INSERT INTO Test_Codes (Code, ud)
        SELECT TOP 1 B.Code, @Name
        FROM Test_Codes A
        RIGHT OUTER JOIN (  SELECT randoms.Code
                            FROM (
                                VALUES 
                                (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
                                (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
                                (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
                                (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT)),
                                (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT))
                            ) randoms (Code) ) B ON B.Code = A.Code
        WHERE A.Code IS NULL
END
GO

Note the PREDICATE here is a simple NULL comparison. You are guaranteed to get a unique value with a solution that is tabular. If you are concerned by the occasion that no rows will be inserted, then add a fail-safe check (insert operations are expensive).

SELECT TOP 1 B.Code, @Name
        FROM Test_Codes A
        RIGHT OUTER JOIN (table of NEWID() rows) B
WHERE A.Code IS NULL

You join the tables by any matching rows based on the smallest table (B) and run a predicate that removes any result where Test_Codes has a match. The predicate uses a boolean comparisons of NULL or not NULL, which the optimizer is highly efficient at guessing and sorting through.

This is a set-based solution and fundamentally is how relational languages work.

/* Dynamic Method. Allows to create a job that is simple as a flag to optimize */
CREATE PROCEDURE dbo.USP_Code_INS2 (@BatchNumber INT, @Name NVARCHAR(100) )
AS
BEGIN
     DECLARE @BatchNewID NVARCHAR(MAX) = N'(CAST((abs(CHECKSUM(newid())) % 1000000) AS INT))'
           , @Count INT = 1
           , @SQLStatement NVARCHAR(MAX)
           , @ParmDefinitions NVARCHAR(100) = N'@Name NVARCHAR(100)'
     /*  Dynamically create newid() table to flag-size. */
     WHILE @Count < @BatchNumber AND @BatchNumber > 1
        BEGIN
            SET @BatchNewID = @BatchNewID + N'
                                , (CAST((abs(CHECKSUM(newid())) % 1000000) AS INT))'
            SET @Count = @Count + 1
        END
    /* Now insert into SQL statement */
    SET @SQLStatement = N'INSERT INTO Test_Codes (Code, ud)
        SELECT TOP 1 B.Code, @Name
        FROM Test_Codes A
        RIGHT OUTER JOIN (  SELECT randoms.Code
                            FROM (
                                VALUES ' + @BatchNewID
                        +   N') randoms (Code) ) B ON B.Code = A.Code
        WHERE A.Code IS NULL'
    --PRINT @SQLStatement
    EXEC sp_executesql @SQLStatement, @ParmDefinitions
            , @Name = @Name
END

Conclusion

  • Avoid functions as it over-complicates your solution.
  • Stored Procedures offer a clear, simple and thus manageable solution. Microsoft provides a lot of solutions, so utilize what they are made for.

Benefits of Stored Procedures

Precompiled execution: SQL Server compiles each Stored Procedure once and then re utilizes the execution plan. This results in tremendous performance boosts when Stored Procedures are called repeatedly.

Reduced client/server traffic: If network bandwidth is a concern in your environment then you’ll be happy to learn that Stored Procedures can reduce long SQL queries to a single line that is transmitted over the wire.

Efficient reuse of code and programming abstraction: Stored Procedures can be used by multiple users and client programs. If you utilize them in a planned manner then you’ll find the development cycle requires less time.

Enhanced security controls: You can grant users permission to execute a Stored Procedure independently of underlying table

MSDN - Difference between a stored procedure and function

Upvotes: 1

Related Questions