Reputation: 10703
Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN
is much faster then the second query using an INNER JOIN
. I do not understand why can someone help explain please.
Upvotes: 3
Views: 711
Reputation: 10600
Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id)
. That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.
Upvotes: 2
Reputation: 453142
Generally speaking IN
is different from JOIN
in that a JOIN
can return additional rows where a row has more than one match in the JOIN
-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN
then they will be removed by the GROUP BY
as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2)
will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2)
(or AVG
or COUNT
) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX
and those such as SUM
and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54
for IN
vs 28271800
for JOIN
but these estimates won't necessarily be very reliable as the join predicate is unsargable.
Upvotes: 5
Reputation: 754438
Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement
Upvotes: 2