Raheel Hasan
Raheel Hasan

Reputation: 6033

TSQL - subquery inside Begin End

Consider the following query:

begin 
;with 

t1 as (
    select top(10) x from tableX
),

t2 as (
    select * from t1 
),

t3 as (
    select * from t1 
)
-- --------------------------

select *     
from t2
join t3 on t3.x=t2.x

end
go

I was wondering if t1 is called twice hence tableX being called twice (which means t1 acts like a table)?

or just once with its rows saved in t1 for the whole query (like a variable in a programming lang)?

Just trying to figure out how tsql engine optimises this. This is important to know because if t1 has millions of rows and is being called many times in the whole query generating the same result then there should be a better way to do it..

Upvotes: 1

Views: 259

Answers (3)

Raheel Hasan
Raheel Hasan

Reputation: 6033

Here is a simple example to test the theories: First, via temporary table which calls the matter only once.

declare @r1 table (id int, v uniqueidentifier);

insert into @r1 
SELECT * FROM 
(
    select id=1, NewId() as 'v' union   
    select id=2, NewId()
) t
-- -----------

begin 
;with 

    t1 as (
        select * from @r1
    ),

    t2 as (
        select * from t1    
    ),

    t3 as (
        select * from t1    
    )

    -- ----------------

    select * from t2
    union all select * from t3

end
go

enter image description here


On the other hand, if we put the matter inside t1 instead of the temporary table, it gets called twice.

t1 as (
        select id=1, NewId() as 'v' union   
        select id=2, NewId()
    )

enter image description here

Hence, my conclusion is to use temporary table and not reply on cached results. Also, ive implemented this on a large scale query that called the "matter" twice only and after moving it to temporary table the execution time went straight half!!

Upvotes: 0

Gottfried Lesigang
Gottfried Lesigang

Reputation: 67341

To be honest, there is no answer... The only answer is Race your horses (Eric Lippert).

The way you write your query does not tell you, how the engine will put it in execution. This depends on many, many influences...

You tell the engine, what you want to get and the engine decides how to get this.

This may even differ between identical calls depending on statistics, currently running queries, existing cached results etc.

Just as a hint, try this:

USE master;
GO
CREATE DATABASE testDB;
GO
USE testDB;
GO

--I create a physical test table with 1.000.000 rows

CREATE TABLE testTbl(ID INT IDENTITY PRIMARY KEY, SomeValue VARCHAR(100));

WITH MioRows(Nr) AS (SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values v1 CROSS JOIN master..spt_values v2 CROSS JOIN master..spt_values v3)
INSERT INTO testTbl(SomeValue)
SELECT CONCAT('Test',Nr)
FROM MioRows;

--Now we can start to test this

GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DECLARE @dt DATETIME2 = SYSUTCDATETIME();

--Your approach with CTEs

;with t1 as (select * from testTbl)
     ,t2 as (select * from t1)
     ,t3 as (select * from t1)
select t2.ID AS t2_ID,t2.SomeValue AS t2_SomeValue,t3.ID AS t3_ID,t3.SomeValue AS t3_SomeValue  INTO target1   
from t2
join t3 on t3.ID=t2.ID;

SELECT 'Final CTE',DATEDIFF(MILLISECOND,@dt,SYSUTCDATETIME());
GO

CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DECLARE @dt DATETIME2 = SYSUTCDATETIME();

--Writing the intermediate result into a physical table

SELECT * INTO test1 FROM testTbl;
SELECT 'Write into test1',DATEDIFF(MILLISECOND,@dt,SYSUTCDATETIME());

select t2.ID AS t2_ID,t2.SomeValue AS t2_SomeValue,t3.ID AS t3_ID,t3.SomeValue AS t3_SomeValue  INTO target2   
from test1 t2
join test1 t3 on t3.ID=t2.ID

SELECT 'Final physical table',DATEDIFF(MILLISECOND,@dt,SYSUTCDATETIME());

GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DECLARE @dt DATETIME2 = SYSUTCDATETIME();

--Same as before, but with an primary key on the intermediate table

SELECT * INTO test2 FROM testTbl;
SELECT 'Write into test2',DATEDIFF(MILLISECOND,@dt,SYSUTCDATETIME());
ALTER TABLE test2 ADD PRIMARY KEY (ID);
SELECT 'Add PK',DATEDIFF(MILLISECOND,@dt,SYSUTCDATETIME());

select t2.ID AS t2_ID,t2.SomeValue AS t2_SomeValue,t3.ID AS t3_ID,t3.SomeValue AS t3_SomeValue  INTO target3  
from test2 t2
join test2 t3 on t3.ID=t2.ID

SELECT 'Final physical tabel with PK',DATEDIFF(MILLISECOND,@dt,SYSUTCDATETIME());

--Clean up (Careful with real data!!!)

GO
USE master;
GO
--DROP DATABASE testDB;
GO

On my system the

  • first takes 674ms, the
  • second 1.205ms (297 for writing into test1) and the
  • third 1.727ms (285 for writing into test2 and ~650ms for creating the index.

Although the query is performed twice, the engine can take advantage of cached results.

Conclusio

The engine is really smart... Don't try to be smarter...

If the table would cover a lot of columns and much more data per row the whole test might return something else...

If your CTEs (sub-queries) involve much more complex data with joins, views, functions and so on, the engine might get into troubles finding the best approach.

If performance matters, you can race your horses to test it out. One hint: I sometimes used a TABLE HINT quite successfully: FORCE ORDER. This will perform joins in the order specified in the query.

Upvotes: 2

gotqn
gotqn

Reputation: 43666

Just create the table:

CREATE TABLE tableX
(
    x int PRIMARY KEY
);

INSERT INTO tableX
VALUES (1)
      ,(2)

Turn on the execution plan generation and execute the query. You will get something like this:

enter image description here

So, yes, the table is queried two times. If you are using complex common table expression and you are working with huge amount of data, I will advice to store the result in temporary table.

Sometimes, I am getting very bad execution plans for complex CTEs which were working nicely in the past. Also, you are allowed to define indexes on temporary tables and improve performance further.

Upvotes: 3

Related Questions