asmgx
asmgx

Reputation: 7984

Why changing where statement to a variable cause query to be 4 times slower

I am inserting data from one table "Tags" from "Recovery" database into another table "Tags" in "R3" database

they all live in my laptop similar SQL Server instance

I have built the insert query and because Recovery..Tags table is around 180M records I decided to break it into smaller sebsets. ( 1 million recs at the time)

Here is my query (Let's call Query A)

insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags  T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between 13000001 and 14000000

it takes around 2 minutes.

That is ok

To make things a bit easier for me

I put the iiD in the were statement in a variable

so my query looks like this (Let's call Query B)

declare @i int = 12

insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags  T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between (1000000 * @i) + 1 and (@i+1)*1000000

but that cause the insert to become so slow (around 10 min)

So what I tried query A again and gave me around 2 min

I tried query B again and gave around 8 min!!

I am attaching exec plan for each one (at a site that shows an analysis of the query plan) - Query A Plan and Query B Plan

Any idea why this is happening? and how to fix it?

Upvotes: 1

Views: 72

Answers (2)

seanb
seanb

Reputation: 6685

The big difference in time is due to the very different plans that are being created to join Tags and Reps.

Fundamentally, in version A, it knows how much data is being extracted (a million rows) and it can design an efficient query for that. However, because you are using variables in B to define how much data is being imported, it has to define a more generic query - one that would work for 10 rows, a million rows, or a hundred million rows.

In the plans, here are the relevant sections of the query joining Tags and Reps...

... in A enter image description here

... and B

enter image description here

Note that in A it takes just over a minute to do the join; in B it takes 6 and a half minutes.

The key thing that appears to take the time is that it does a table scan of the Tags table which takes 5:44 to complete. The plan has this as a table scan, as the next time you run the query you may want many more than 1 million rows.

A secondary issue is that the amount of data it reads (or expects to read) from Reps is also way out of whack. In A it expected to read 2 million rows and read 1421; in B it basically read them all (even though technically it probably only needed the same 1421).

I think you have two main approaches to fix

  • Look at indexing, to remove the table scan on Tags - ensure the indexes match what is needed and allows the query to do a scan on that index (it appears that the index at the top of @MikePetri's answer is what you need, or similar). This way instead of doing a table scan, it can do an index scan which can start 'in the middle' of the data set (a table scan must start at either the start or end of the data set).
  • Separate this into two processes. The first process gets the relevant million rows from Tags, and saves it in a temporary table. The second process uses the data in the temporary table to join to Reps (also try using option (recompile) in the second query, so that it checks the temporary table's size before creating the plan).

You can even put an index or two (and/or Primary Key) on that temporary table to make it better for the next step.

Upvotes: 2

Mike Petri
Mike Petri

Reputation: 605

The reason the first query is so much faster is it went parallel. This means the cardinality estimator knew enough about the data it had to handle, and the query was large enough to tip the threshold for parallel execution. Then, the engine passed chunks of data for different processors to handle individually, then report back and repartition the streams.

With the value as a variable, it effectively becomes a scalar function evaluation, and a query cannot go parallel with a scalar function, because the value has to determined before the cardinality estimator can figure out what to do with it. Therefore, it runs in a single thread, and is slower.

Some sort of looping mechanism might help. Create the included indexes to assist the engine in handling this request. You can probably find a better looping mechanism, since you are familiar with the identity ranges you care about, but this should get you in the right direction. Adjust for your needs.

With a loop like this, it commits the changes with each loop, so you aren't locking the table indefinitely.

USE Recovery;
GO

CREATE INDEX NCI_iID
ON Tags (iID)
INCLUDE (
            DT
            ,RepID
            ,tag
            ,xmiID
            ,iBegin
            ,iEnd
            ,Confidence
            ,Polarity
            ,Uncertainty
            ,Conditional
            ,Generic
            ,HistoryOf
            ,CodingScheme
            ,Code
            ,CUI
            ,TUI
            ,PreferredText
            ,ValueBegin
            ,ValueEnd
            ,value
            ,Deleted
            ,sKey
        );
GO

CREATE INDEX NCI_RepID ON Reps (RepID) INCLUDE (RepType);

USE R3;
GO

CREATE INDEX NCI_iID ON Tags (iID);
GO

DECLARE @RowsToProcess  BIGINT
        ,@StepIncrement INT = 1000000;

SELECT  @RowsToProcess = (
                             SELECT COUNT(1)
                             FROM   Recovery..tags AS T
                             WHERE  NOT EXISTS (
                                                   SELECT   1
                                                   FROM     R3..Tags AS rt
                                                   WHERE    T.iID = rt.iID
                                               )
                         );

WHILE @RowsToProcess > 0
BEGIN
    INSERT INTO R3..Tags
    (
        iID
        ,DT
        ,RepID
        ,Tag
        ,xmiID
        ,iBegin
        ,iEnd
        ,Confidence
        ,Polarity
        ,Uncertainty
        ,Conditional
        ,Generic
        ,HistoryOf
        ,CodingScheme
        ,Code
        ,CUI
        ,TUI
        ,PreferredText
        ,ValueBegin
        ,ValueEnd
        ,Value
        ,Deleted
        ,sKey
        ,RepType
    )
    SELECT  TOP (@StepIncrement)
            T.iID
            ,T.DT
            ,T.RepID
            ,T.Tag
            ,T.xmiID
            ,T.iBegin
            ,T.iEnd
            ,T.Confidence
            ,T.Polarity
            ,T.Uncertainty
            ,T.Conditional
            ,T.Generic
            ,T.HistoryOf
            ,T.CodingScheme
            ,T.Code
            ,T.CUI
            ,T.TUI
            ,T.PreferredText
            ,T.ValueBegin
            ,T.ValueEnd
            ,T.Value
            ,T.Deleted
            ,T.sKey
            ,R.RepType
    FROM    Recovery..tags AS T
            INNER JOIN Recovery..Reps AS R ON T.RepID = R.RepID
    WHERE   NOT EXISTS (
                           SELECT   1
                           FROM     R3..Tags AS rt
                           WHERE    T.iID = rt.iID
                       )
    ORDER BY
            T.iID;

    SET @RowsToProcess = @RowsToProcess - @StepIncrement;
END;

Upvotes: 2

Related Questions