Techie
Techie

Reputation: 85

Which is best to use between the IN and JOIN operators in SQL server for the list of values as table two?

I heard that the IN operator is costlier than the JOIN operator.
Is that true?

Example case for IN operator:

SELECT * 
FROM table_one 
WHERE column_one IN (SELECT column_one FROM table_two)

Example case for JOIN operator:

SELECT * 
FROM table_one TOne 
JOIN (select column_one from table_two) AS TTwo
    ON TOne.column_one = TTwo.column_one

In the above query, which is recommended to use and why?

Upvotes: 0

Views: 593

Answers (3)

Zohar Peled
Zohar Peled

Reputation: 82514

tl;dr; - once the queries are fixed so that they will yield the same results, the performance is the same.

Both queries are not the same, and will yield different results.

The IN query will return all the columns from table_one,
while the JOIN query will return all the columns from both tables.

That can be solved easily by replacing the * in the second query to table_one.*, or better yet, specify only the columns you want to get back from the query (which is best practice).

However, even if that issue is changed, the queries might still yield different results if the values on table_two.column_one are not unique.
The IN query will yield a single record from table_one even if it fits multiple records in table_two, while the JOIN query will simply duplicate the records as many times as the criteria in the ON clause is met.

Having said all that - if the values in table_two.column_one are guaranteed to be unique, and the join query is changed to select table_one.*... - then, and only then, will both queries yield the same results - and that would be a valid question to compare their performance.

So, in the performance front:

The IN operator has a history of poor performance with a large values list - in earlier versions of SQL Server, if you would have used the IN operator with, say, 10,000 or more values, it would have suffer from a performance issue.

With a small values list (say, up to 5,000, probably even more) there's absolutely no difference in performance.

However, in currently supported versions of SQL Server (that is, 2012 or higher), the query optimizer is smart enough to understand that in the conditions specified above these queries are equivalent and might generate exactly the same execution plan for both queries - so performance will be the same for both queries.

UPDATE: I've done some performance research, on the only available version I have for SQL Server which is 2016 . First, I've made sure that Column_One in Table_Two is unique by setting it as the primary key of the table.

CREATE TABLE Table_One
(
    id int,
    CONSTRAINT PK_Table_One PRIMARY KEY(Id)
);

CREATE TABLE Table_Two
(
    column_one int,
    CONSTRAINT PK_Table_Two PRIMARY KEY(column_one)
);

Then, I've populated both tables with 1,000,000 (one million) rows.

SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY @@SPID) As N INTO Tally
FROM sys.objects A
CROSS JOIN sys.objects B
CROSS JOIN sys.objects C;

INSERT INTO Table_One (id) 
SELECT N
FROM Tally;

INSERT INTO Table_Two (column_one) 
SELECT N 
FROM Tally;

Next, I've ran four different ways of getting all the values of table_one that matches values of table_two. - The first two are from the original question (with minor changes), the third is a simplified version of the join query, and the fourth is a query that uses the exists operator with a correlated subquery instead of the in operaor`,

SELECT * 
FROM table_one
WHERE Id IN (SELECT column_one FROM table_two);

SELECT TOne.* 
FROM table_one TOne 
JOIN (select column_one from table_two) AS TTwo
    ON TOne.id = TTwo.column_one;

SELECT TOne.* 
FROM table_one TOne 
JOIN table_two AS TTwo
    ON TOne.id = TTwo.column_one;  

SELECT * 
FROM table_one
WHERE EXISTS
(
    SELECT 1 
    FROM table_two
    WHERE column_one = id
);

All four queries yielded the exact same result with the exact same execution plan - so from it's safe to say performance, under these circumstances, are exactly the same.

You can copy the full script (with comments) from Rextester (result is the same with any number of rows in the tally table).

Upvotes: 2

Shekar Kola
Shekar Kola

Reputation: 1297

In the above query, which is recommended to use and why?

The second (JOIN) query cannot be optimal compare to first query unless you put where clause within sub-query as follows:

Select * from table_one TOne 
JOIN (select column_one from table_two where column_tow = 'Some Value') AS TTwo
ON TOne.column_one = TTwo.column_one

However, the better decision can be based on execution plan with following points into consideration:

  1. How many tasks the query has to perform to get the result
  2. What is task type and execution time of each task
  3. Variance between Estimated number of row and Actual number of rows in each task - this can be fixed by UPDATED STATISTICS on TABLE if the variance too high.

In general, the Logical Processing Order of the SELECT statement goes as follows, considering that if you manage your query to read the less amount of rows/pages at higher level (as per following order) would make that query less logical I/O cost and eventually query is more optimized. i.e. It's optimal to get rows filtered within From or Where clause rather than filtering it in GROUP BY or HAVING clause.

  1. FROM
  2. ON
  3. JOIN
  4. WHERE
  5. GROUP BY
  6. WITH CUBE or WITH ROLLUP
  7. HAVING
  8. SELECT
  9. DISTINCT
  10. ORDER BY
  11. TOP

Upvotes: 0

Barbaros Özhan
Barbaros Özhan

Reputation: 65363

From the point of performance view, mostly, using EXISTS might be a better option rather than using IN operator and JOIN among the tables :

SELECT TOne.*
  FROM table_one TOne
 WHERE EXISTS ( SELECT 1 FROM table_two TTwo WHERE TOne.column_one = TTwo.column_one )

If you need the columns from both tables, and provided those have indexes on the column column_one used in the join condition, using a JOIN would be better than using an IN operator, since you will be able to benefit from the indexes :

SELECT TOne.*, TTwo.*
  FROM table_one TOne
  JOIN table_two TTwo
    ON TOne.column_one = TTwo.column_one

Upvotes: 0

Related Questions