EvilDr
EvilDr

Reputation: 9632

How is a TSQL Join implemented internally when there is a many-to-single relationship?

Apologies if this has been posted elsewhere; there are so many questions relating to joining on duplicates, I cannot find an answer. Note that this question is not asking how to remove duplicate rows in results.

Take the following scenario, whereby two tables are joined using a text-match, but one of the tables contains a large number of duplicates:

CREATE TABLE #test (ID int PRIMARY KEY IDENTITY(1,1), textval nvarchar(250));
INSERT INTO #test (textval) VALUES (N'Luke'),(N'Han'),(N'Vader');

DECLARE @tmp TABLE (textval nvarchar(250));
INSERT INTO @tmp VALUES (N'Luke'),(N'Luke'),(N'Luke'),(N'Luke'),(N'Luke'),(N'Jabba');

-- Query 1
SELECT
    tmp.textval,
    t.ID
FROM
    @tmp tmp LEFT JOIN
    #test t ON tmp.textval = t.textval; 

DROP TABLE #test;

I get the desired output here...

Correct result

...but, is that an efficient way to perform this query? Specifically, I am wondering whether TSQL will create a join on every instance of 'Luke' from the table variable, or does it internally remove the duplicates, and therefore only look for 'Luke' once, instead of five times for the five instances?

I tried to look at the statistics and execution plan, but don't know what figures would signify what is happening.

UPDATE

As per Remus's answer, here is the execution plan from the above query, showing zero rewinds/rebinds.

Here is a screen grab of the rewinds/rebinds figure: Can I get a rewiiiiiind?

Upvotes: 0

Views: 59

Answers (2)

Remus Rusanu
Remus Rusanu

Reputation: 294247

You are asking about how is the Join implemented. There are three main strategies:

None of them will 'eliminate' the duplicates, since doing so would be semantically incorrect. However, both Hash and Merge join would 'visit' the value in the #test table only once, while Nested Loops may visit the value multiple times (once for each matching row in the @tmp table). I say may because it depends on which side (left #test or right @tmp) is chosen by SQL optimizer as the driver of the nested loop.

Upvotes: 3

Jayasurya Satheesh
Jayasurya Satheesh

Reputation: 8033

There are no Automatic Ways in which SQL Removes the Duplicates

If you want to Remove the Duplicates across all Columns, then You can simply use the DISTINCT Keyword.

SELECT DISTINCT
    tmp.textval,
    t.ID
FROM
    @tmp tmp LEFT JOIN
    #test t ON tmp.textval = t.textval;

or if what you need to remove Duplicates based on some specific columns, then you can try either of ROW_NUMBER or DENSE_RANK functions.

Example

;WITH CTE
AS
(
SELECT
    RN = ROW_NUMBER() OVER(PARTITION BY tmp.textval ORDER BY t.ID),
    tmp.textval,
    t.ID
FROM
    @tmp tmp LEFT JOIN
    #test t ON tmp.textval = t.textval;
)
SELECT
*
FROM CTE 
WHERE RN = 1

But SQL Server will automatically remove the duplicates if you use a UNION operator and will Display the duplicates for UNION ALL

Upvotes: 2

Related Questions