Matt
Matt

Reputation: 3

Left outer join with Where Clause

I am experienced with Access and about 12 months into SQL Server SSMS.

I am not getting results I expect with a left outer join, and I don't know why. Maybe I don't understand something.

I have Table 1 (the left side) with 600k products I have table 2 with 150,000 products (sub set of table 1).

When I do this

SELECT [Product_Code], [Product_Desc], Store
  FROM [Product Range] 

I get 600,000 records

When I do a left join like this

    SELECT [Product_Code], [Product_Desc], r.store, soh.SOH
      FROM [Product Range] as r
 LEFT JOIN [dbo].SOH as soh on r.[Product_Code] = soh.PRODUCT_Code 
       AND r.store = soh.store
      WHERE soh.CalYearWeek=1512 

I get 500k records. But I am confused. I thought a left join was supposed to return me all records from my left table regardless of anything else.

I then tried this (and I don't know why I would need to add the Null condition anyway)

         SELECT [Product_Code],[Product_Desc],r.store,soh.SOH
           FROM [Product Range] as r
LEFT OUTER JOIN [dbo].SOH as soh on r.[Product_Code] = soh.PRODUCT_Code 
            AND r.store = soh.store
          WHERE soh.CalYearWeek=1512  or soh.CalYearWeek is null

and I got 550,000 records - still not the full 600k.

I am completely confused and don't know what is wrong. Can anyone help me please :-)

Matt

Upvotes: 0

Views: 2397

Answers (3)

Giorgi Nakeuri
Giorgi Nakeuri

Reputation: 35790

In fact the problem is not in WHERE clause. The problem, if you can call this a problem, is in JOIN itself and how it behaves. In fact you can get exactly 600K rows, no rows at all, less then 600K rows or even more then 600K rows. It depends on data in those tables.

You should understand difference between putting predicates in JOIN condition and WHERE clause. There is a big difference. Also you should understand how predicates work with NULLs.

If you have a row with code 'A' in left table, and no row with code 'A' in right table you will get one row from left table and NULLs from right table. If in right table you have one row with code 'A' you will get 1 row from left and one row from right. If you have N rows with code 'A' in left table and M rows with code 'A' in right one, you will get M*N rows in result.

To summarize here is formula for calculating number of rows in result set when using LEFT JOIN:

COUNT = Count of rows from left table where there are no corresponding rows from right table + SUM(COUNT(code[i])*COUNT(code[i])), i.e. sum of cartesian product of counts of distinct matching codes from both tables.

You get at least 600K rows after left join. In year column you can get NULLs in two ways: 1. there was no corresponding row for code in right table, 2. there was corresponding row from right table but column year is NULL itself.

When you are further filtering resultset with soh.CalYearWeek=1512, rows with NULLs and different values are eliminated from result.

Consider example:

DECLARE @t1 TABLE(Code INT)
DECLARE @t2 TABLE(Code INT, Year INT)

INSERT INTO @t1 VALUES
(1), (2), (3)

SELECT * FROM @t1 t1
JOIN @t2 t2 ON t2.Code = t1.Code
WHERE t2.Year = 1512

And now different results depending on data in second table:

--count 1
INSERT INTO @t2 VALUES
(1, 1512)

--count 0
INSERT INTO @t2 VALUES
(1, NULL)

--count 3
INSERT INTO @t2 VALUES
(1, 1512), (1, 1512), (1, 1512)

--count 6
INSERT INTO @t2 VALUES
(1, 1512), (2, 1512), (2, 1512), (3, 1512), (3, 1512), (3, 1512)

Upvotes: 0

Bohemian
Bohemian

Reputation: 425318

The problem us the WHERE conditions are executed after the join is made, so soh.CalYearWeek=1512 will only be true for successful joins - missed joins have all nulls, and the where clause filters them out.

The solution is simple: Move the condition into the join:

SELECT [Product_Code], [Product_Desc], r.store, soh.SOH
FROM [Product Range] as r
LEFT JOIN [dbo].SOH as soh on r.[Product_Code] = soh.PRODUCT_Code 
   AND r.store = soh.store
   AND soh.CalYearWeek=1512

Conditions on the join are executed as the join is being made, so you'll still get your left join, but only to rows in the right table that have that special condition.

Putting non-null conditions on the right table in the WHERE clause effectively turns a LEFT join into an INNER join, since the right table can only have a non-null value if the join was successful.

Upvotes: 4

mcr
mcr

Reputation: 782

You're correct in that a basic left join with no WHERE clauses will return a row for all records in the LEFT table with either data for the RIGHT table when it exists, or NULL where it doesn't.

And that is what you're getting, but then you're adding a WHERE clause which will filter out certain rows. So if you just had :

SELECT [Product_Code] ,[Product_Desc] ,r.store ,soh.SOH
FROM [Product Range] as r left join [dbo].SOH as soh 
                          on r.[Product_Code] = soh.PRODUCT_Code 
                             and r.store = soh.store

Then you would be seeing 600k records returned. But then you're removing the 100k records where soh.CalYearWeek is not 1512 with the line :

WHERE soh.CalYearWeek=1512

By adding the :

or soh.CalYearWeek is null

You are adding back 50k more records where that is true. So basically, the WHERE clause acts upon the whole set of records at that time (after the join has taken place) and filters out rows which don't match. The mention of RIGHTTABLE.COLUMN in a where clause is really just because by then, the column in the full row is decribed by that full identifier rather than just its column name alone.

Upvotes: 0

Related Questions