Reputation: 9632
Apologies if this has been posted elsewhere; there are so many questions relating to joining on duplicates, I cannot find an answer. Note that this question is not asking how to remove duplicate rows in results.
Take the following scenario, whereby two tables are joined using a text-match, but one of the tables contains a large number of duplicates:
CREATE TABLE #test (ID int PRIMARY KEY IDENTITY(1,1), textval nvarchar(250));
INSERT INTO #test (textval) VALUES (N'Luke'),(N'Han'),(N'Vader');
DECLARE @tmp TABLE (textval nvarchar(250));
INSERT INTO @tmp VALUES (N'Luke'),(N'Luke'),(N'Luke'),(N'Luke'),(N'Luke'),(N'Jabba');
-- Query 1
SELECT
tmp.textval,
t.ID
FROM
@tmp tmp LEFT JOIN
#test t ON tmp.textval = t.textval;
DROP TABLE #test;
I get the desired output here...
...but, is that an efficient way to perform this query? Specifically, I am wondering whether TSQL will create a join on every instance of 'Luke' from the table variable, or does it internally remove the duplicates, and therefore only look for 'Luke' once, instead of five times for the five instances?
I tried to look at the statistics and execution plan, but don't know what figures would signify what is happening.
As per Remus's answer, here is the execution plan from the above query, showing zero rewinds/rebinds.
Here is a screen grab of the rewinds/rebinds figure:
Upvotes: 0
Views: 59
Reputation: 294247
You are asking about how is the Join implemented. There are three main strategies:
None of them will 'eliminate' the duplicates, since doing so would be semantically incorrect. However, both Hash and Merge join would 'visit' the value in the #test
table only once, while Nested Loops may visit the value multiple times (once for each matching row in the @tmp
table). I say may because it depends on which side (left #test
or right @tmp
) is chosen by SQL optimizer as the driver of the nested loop.
Upvotes: 3
Reputation: 8033
There are no Automatic Ways in which SQL Removes the Duplicates
If you want to Remove the Duplicates across all Columns, then You can simply use the DISTINCT
Keyword.
SELECT DISTINCT
tmp.textval,
t.ID
FROM
@tmp tmp LEFT JOIN
#test t ON tmp.textval = t.textval;
or if what you need to remove Duplicates based on some specific columns, then you can try either of ROW_NUMBER
or DENSE_RANK
functions.
Example
;WITH CTE
AS
(
SELECT
RN = ROW_NUMBER() OVER(PARTITION BY tmp.textval ORDER BY t.ID),
tmp.textval,
t.ID
FROM
@tmp tmp LEFT JOIN
#test t ON tmp.textval = t.textval;
)
SELECT
*
FROM CTE
WHERE RN = 1
But SQL Server will automatically remove the duplicates if you use a UNION
operator and will Display the duplicates for UNION ALL
Upvotes: 2