Reputation: 3
I am experienced with Access and about 12 months into SQL Server SSMS.
I am not getting results I expect with a left outer join, and I don't know why. Maybe I don't understand something.
I have Table 1 (the left side) with 600k products I have table 2 with 150,000 products (sub set of table 1).
When I do this
SELECT [Product_Code], [Product_Desc], Store
FROM [Product Range]
I get 600,000 records
When I do a left join like this
SELECT [Product_Code], [Product_Desc], r.store, soh.SOH
FROM [Product Range] as r
LEFT JOIN [dbo].SOH as soh on r.[Product_Code] = soh.PRODUCT_Code
AND r.store = soh.store
WHERE soh.CalYearWeek=1512
I get 500k records. But I am confused. I thought a left join was supposed to return me all records from my left table regardless of anything else.
I then tried this (and I don't know why I would need to add the Null condition anyway)
SELECT [Product_Code],[Product_Desc],r.store,soh.SOH
FROM [Product Range] as r
LEFT OUTER JOIN [dbo].SOH as soh on r.[Product_Code] = soh.PRODUCT_Code
AND r.store = soh.store
WHERE soh.CalYearWeek=1512 or soh.CalYearWeek is null
and I got 550,000 records - still not the full 600k.
I am completely confused and don't know what is wrong. Can anyone help me please :-)
Matt
Upvotes: 0
Views: 2397
Reputation: 35790
In fact the problem is not in WHERE
clause. The problem, if you can call this a problem, is in JOIN
itself and how it behaves. In fact you can get exactly 600K rows, no rows at all, less then 600K rows or even more then 600K rows. It depends on data in those tables.
You should understand difference between putting predicates in JOIN
condition and WHERE
clause. There is a big difference. Also you should understand how predicates work with NULL
s.
If you have a row with code 'A' in left table, and no row with code 'A' in right table you will get one row from left table and NULLs from right table. If in right table you have one row with code 'A' you will get 1 row from left and one row from right. If you have N
rows with code 'A' in left table and M
rows with code 'A' in right one, you will get M*N
rows in result.
To summarize here is formula for calculating number of rows in result set when using LEFT JOIN
:
COUNT = Count of rows from left table where there are no corresponding rows from right table + SUM(COUNT(code[i])*COUNT(code[i]))
, i.e. sum of cartesian product of counts of distinct matching codes from both tables.
You get at least 600K rows after left join. In year
column you can get NULLs in two ways: 1. there was no corresponding row for code in right table, 2. there was corresponding row from right table but column year is NULL itself.
When you are further filtering resultset with soh.CalYearWeek=1512
, rows with NULLs and different values are eliminated from result.
Consider example:
DECLARE @t1 TABLE(Code INT)
DECLARE @t2 TABLE(Code INT, Year INT)
INSERT INTO @t1 VALUES
(1), (2), (3)
SELECT * FROM @t1 t1
JOIN @t2 t2 ON t2.Code = t1.Code
WHERE t2.Year = 1512
And now different results depending on data in second table:
--count 1
INSERT INTO @t2 VALUES
(1, 1512)
--count 0
INSERT INTO @t2 VALUES
(1, NULL)
--count 3
INSERT INTO @t2 VALUES
(1, 1512), (1, 1512), (1, 1512)
--count 6
INSERT INTO @t2 VALUES
(1, 1512), (2, 1512), (2, 1512), (3, 1512), (3, 1512), (3, 1512)
Upvotes: 0
Reputation: 425318
The problem us the WHERE
conditions are executed after the join is made, so soh.CalYearWeek=1512
will only be true for successful joins - missed joins have all nulls, and the where clause filters them out.
The solution is simple: Move the condition into the join:
SELECT [Product_Code], [Product_Desc], r.store, soh.SOH
FROM [Product Range] as r
LEFT JOIN [dbo].SOH as soh on r.[Product_Code] = soh.PRODUCT_Code
AND r.store = soh.store
AND soh.CalYearWeek=1512
Conditions on the join are executed as the join is being made, so you'll still get your left join, but only to rows in the right table that have that special condition.
Putting non-null conditions on the right table in the WHERE
clause effectively turns a LEFT join into an INNER join, since the right table can only have a non-null value if the join was successful.
Upvotes: 4
Reputation: 782
You're correct in that a basic left join with no WHERE clauses will return a row for all records in the LEFT table with either data for the RIGHT table when it exists, or NULL where it doesn't.
And that is what you're getting, but then you're adding a WHERE clause which will filter out certain rows. So if you just had :
SELECT [Product_Code] ,[Product_Desc] ,r.store ,soh.SOH
FROM [Product Range] as r left join [dbo].SOH as soh
on r.[Product_Code] = soh.PRODUCT_Code
and r.store = soh.store
Then you would be seeing 600k records returned. But then you're removing the 100k records where soh.CalYearWeek is not 1512 with the line :
WHERE soh.CalYearWeek=1512
By adding the :
or soh.CalYearWeek is null
You are adding back 50k more records where that is true. So basically, the WHERE clause acts upon the whole set of records at that time (after the join has taken place) and filters out rows which don't match. The mention of RIGHTTABLE.COLUMN in a where clause is really just because by then, the column in the full row is decribed by that full identifier rather than just its column name alone.
Upvotes: 0