Reputation: 11

The same SQL query is yielding different results each time it is run. How do I resolve?

I'm having an issue where I am running a query and each time it produces different results. I'm working with relatively large sets of data (700,000+) and the differences are about 50-100.

I'm taking raw data from a staging table on a specific date (so the data is not changing) and then I'm loading that into a temp table.

Each time I check the count of the Temp table, it is the same so the issue is not there. However on the next part, I'm adding some grouping around dates and other specific details. And when I run this, it will have a different population each time it is run.

As an example, this is a very basic version of my script:

IF OBJECT_ID(N'tempdb..#Table1') IS NOT NULL
DROP TABLE [#Table1]
SELECT ORDER ID
,CITY
,DATE
,AMOUNT
,CURRENCY
INTO #TABLE1
FROM #TABLE2 --This is the data from the staging table

--GROUPING
IF OBJECT_ID(N'tempdb..#Table3') IS NOT NULL
DROP TABLE [#Table3]
SELECT CITY
,DATE
,ROUND(SUM(CONVERT(FLOAT, AMOUNT)),-1)
,CURRENCY
INTO #TABLE3
FROM #TABLE1
GROUP BY CITY,DATE,CURRENCY
HAVING ROUND(SUM(CONVERT(FLOAT, AMOUNT)),-1) = 0

Each time I run this, the record count for #Table3 changes. Any ideas what could be causing this? as mentioned, #Table1 shows the full data set consistently and the underlying source data has not changed.

Upvotes: 0

Answers (1)

Dai

Reputation: 155390

#TemporaryTables - just say no.
- To be clear: #TemporaryTables are a powerful and useful tool, however their lifetime is based on client-connections or procedure scope, which makes me instantly want to avoid them because the lifetime of a client-connection or session is non-deterministic.
- Especially when working in SSMS: each editor window/tab has its own connection to SQL Server, which can be broken at any time - and reconnected - and the Object Explorer has its own connection too.
- And if you're connecting directly to SQL Server over the Internet (please don't) through a high-latency (i.e. anything over 30ms: TDS is very lag-sensitive) then your connection will sporadically drop which may-or-may-not cause your #tempoaryTables to be dropped - it's that uncertainty that puts me off.
Anyway, it's because #temporaryTables' lifetime is bound to your connection rather than any meaningful scope-of-work means that they also make query-batches that use them non-deterministic, which I think is what you're experiencing.
The other main possible reason for the non-determinism is because you're using IEEE-754 floating point-types (i.e. float) to represent monetary values.
- WHICH YOU MUST NEVER DO
  - IEEE-754 types are approximate values that have the lovely habit of losing accuracy/precision every time you do anything to them.
  - Fun-fact: you cannot represent the exact decimal value 0.1 as a IEEE-754 value.
  - Because operations on IEEE-754 types have weird associativity and are not strictly commutative it means that set-based operations (i.e. SUM in SQL) becomes non-deterministic. This is why SQL Server restricts how float values are used in deterministic contexts like indexed-views.
    - IEEE-754 types are still deterministic, but it's a complicated topic to discuss.
  - Always use money or decimal types when working with money/currency values.
    - Some people think money should be avoided and decimal(n,2) used instead because of how money handles money × money multiplication or money / money division, which is fair, but there's no reason to multiply money by money... so this will remain a controversy amongst SQL Server DBAs for a while, methinks.

I recommend you change your script to use only table-variables. Like so:

I assume that you absolutely have to keep on using #table2 for source-data.
- But please consider changing it to a normal (non-temporary, non-table-variable) table if it contains persisted or expensive-to-produce data.
Your query has HAVING ROUND( x, -1 ) = 0 - which I don't think is useful - nor particularly clear.
- ROUND( x, -1 ) means "round to the nearest tenth-place (including all digits to the right, including digits to the right of the radix place).
  - e.g. ROUND( 12345.67, -1 ) is 120.00.
- ROUND( x, -1 ) will only evaluate to 0 when -5 < x < 5 (i.e. -4.999999… <= x <= 4.999999…)
- The same logical condition can be much-simply expressed using just comparison operators as HAVING x > -5 AND x < 5.
  - I assume you were trying to avoid having SUM() mentioned twice, as though you're concerned that HAVING SUM( Amount ) > -5 AND SUM( Amount ) < 5 would make SQL Server compute SUM twice.
    - SQL Server might be kinda dumb, but it isn't that dumb, I can assure you.
    - Another alternative is to use a CTE with the SUM as a named column in the initial SELECT and move the predicate to an outer WHERE clause instead of HAVING. This is the approach used in my answer below (look for WITH g AS).
My code uses double-quotes " to escape object-identifiers instead of [] square-brackets, this is because ISO/ANSI SQL uses " so it's the right thing to do, and makes your already-very-proprietary SQL code ever-so-slightly more portable.
I wrapped everything in a BEGIN TRY and BEGIN TRANSACTION - even though no normal tables are being mutated, because I assume that eventually you are going to make changes you'll want to persist, so I thought I'd scaffold that for you.


SET XACT_ABORT ON; <-- Everyone should be doing this: https://stackoverflow.com/questions/1150032/what-is-the-benefit-of-using-set-xact-abort-on-in-a-stored-procedure

BEGIN TRY;

    BEGIN TRANSACTION;

    -- As this script depends on the state of #table2, you should explicitly validate its contents before continuing:

    DECLARE @expectedTable2RowCount        int           = 1234;
    DECLARE @expectedTable2CurrencySum     decimal(19,2) = 1234567.89;

    DECLARE @actualTable2RowCount          int           = ( SELECT COUNT(*) FROM #table1 );
    DECLARE @actualTable1Table1CurrencySum decimal(19,2) = ( SELECT SUM( "Amount" ) FROM #table2 );

    IF @expectedTable2RowCount <> @actualTable2RowCount OR @expectedTable2CurrencySum <> @actualTable2Table1CurrencySum
    BEGIN
        THROW 50000, '#table1 contains unexpected data. Aborting script.', 0;
    END;

    ------------------

    DECLARE @myTable2Copy TABLE (
        "Order Id" int           NOT NULL,
        "City"     nvarchar(50)  NOT NULL,
        "Date"     date          NOT NULL,
        "Amount"   decimal(19,2) NOT NULL,
        "Currency" char(3)       NOT NULL,

        PRIMARY KEY ( "Order Id" )
    );

    INSERT INTO @myTable1Copy (
        "Order Id", "City", "Date", "Amount", "Currency"
    )
    SELECT
        "Order Id", "City", "Date", "Amount", "Currency"
    FROM
        #table2;

    ------------------

    DECLARE @myGroupedData TABLE (
        "City"      nvarchar(50)  NOT NULL,
        "Date"      date          NOT NULL,
        "Currency"  char(3)       NOT NULL,

        "SumAmount" decimal(19,2) NOT NULL,

        PRIMARY KEY ( "City", "Date", "Currency" )
    );

    WITH g AS (
        SELECT
            "City",
            "Date",
            "Currency",
            SUM( Amount ) AS SumAmount
        FROM
            @myTable2Copy
        GROUP BY
            "City",
            "Date",
            "Currency"
    )

    INSERT INTO @myGroupedData (
        "City", "Date", "Currency", SumAmount
    )
    SELECT
        "City", "Date", "Currency", SumAmount
    FROM
        g
    WHERE
        g.SumAmount > -5.00
        AND
        g.SumAmount <  5.00;

    ------------------

    COMMIT TRANSACTION;

END TRY;
BEGIN CATCH;

    IF @@TRANCOUNT > 0 ROLLBACK TRANSACTION;
    THROW;

END CATCH;

Upvotes: 2

The same SQL query is yielding different results each time it is run. How do I resolve?

Answers (1)

Related Questions