PhilChuang
PhilChuang

Reputation: 2566

For a nullable joined column, is it better performance to do a outer join, or inner join to a "null" row?

For SQL query performance, what is the best way to handle a null column?

Let's say I have some data objects Person and Employer. A Person can have a reference to an Employer, but they can also be unemployed and that is indicated by having a null reference.

Is it better for DB performance do to a left outer join on that nullable EmployerId field, or do an inner join to a "null" Employer and then in the data retrieval code turn that "null" Employer into an actual null?

(I know this is a trivial example that follows but assume I have many tens of thousands of rows where outer joins start to significantly affect performance)

Outer join on nullable column

TABLE Person (PersonId, Name, EmployerId) (EmployerId is nullable)

(1, 'John Doe', 1)  
(2, 'Joe Smith', NULL)  
(3, 'Jane Doe', 2)  

TABLE Employers (EmployerId, Name)

(1, 'Microsoft')  
(2, 'google')  

QUERY

SELECT * FROM Person LEFT OUTER JOIN Employers ON Person.EmployerId = Employers.EmployerId

(1, 'John Doe', 1, 'Microsoft')  
(2, 'Joe Smith', NULL, NULL)  
(3, 'Jane Doe', 2, 'google')  

Code doesn't have to do anything beyond the DB read.

Inner join on "null" row

TABLE Person (PersonId, Name, EmployerId) (EmployerId is NOT nullable)

(1, 'John Doe', 1)  
(2, 'Joe Smith', 0)  
(3, 'Jane Doe', 2)  

TABLE Employers (EmployerId, Name)

(0, '{NULL}')  
(1, 'Microsoft')  
(2, 'google')  

QUERY

SELECT * FROM Person INNER JOIN Employers ON Person.EmployerId = Employers.EmployerId

(1, 'John Doe', 1, 'Microsoft')  
(2, 'Joe Smith', 0, '{NULL}')  
(3, 'Jane Doe', 2, 'google') 

Code must do DB read, then detect "{NULL}" and convert to null reference.

Upvotes: 0

Views: 4341

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270773

The second version looks weird to me. NULL means more like "value is missing" than it means "the corresponding record with NULL as an index". If you were going to use the second version, then use some other value than NULL.

Typically, when a database engine supports outer joins, the algorithm is amazingly similar to the algorithm for inner joins. If the database supports a nested loop inner join, or an index lookup inner join, or a hash-based inner join, or a marge-sort inner join, then it support the same algorithms for outer join. The difference in performance would be more due to the larger result set for the outer join.

By the way, in your now modified second example, the effect is the same as an inner join. The query is:

SELECT *
FROM Person INNER JOIN
     Employers
     ON Person.EmployerId = Employers.EmployerId

The join condition fails when either or both EmployerIds are NULL. NULL fails all comparisons, except is null. So, the second example will not fetch the employer record with NULL on it.

You can change the on condition to fix this. Here are two ways:

     on (Person.EmployerId = Employers.EmployerId) or
        (Person.EmployerId is NULL and  Employers.EmployerId is null)

     on coalesce(Person.EmployerId, -1) = coalesce(Emlpoyers.EmployerId, -1)

Then the joins are no longer the same in the two cases. And, these joins might perform quite differently from the first case. In particular, or conditions and functions on columns impede the use of indexes for joins.

Upvotes: 3

Related Questions