Reputation: 2566
For SQL query performance, what is the best way to handle a null column?
Let's say I have some data objects Person and Employer. A Person can have a reference to an Employer, but they can also be unemployed and that is indicated by having a null reference.
Is it better for DB performance do to a left outer join on that nullable EmployerId field, or do an inner join to a "null" Employer and then in the data retrieval code turn that "null" Employer into an actual null?
(I know this is a trivial example that follows but assume I have many tens of thousands of rows where outer joins start to significantly affect performance)
Outer join on nullable column
TABLE Person (PersonId, Name, EmployerId) (EmployerId is nullable)
(1, 'John Doe', 1)
(2, 'Joe Smith', NULL)
(3, 'Jane Doe', 2)
TABLE Employers (EmployerId, Name)
(1, 'Microsoft')
(2, 'google')
QUERY
SELECT * FROM Person LEFT OUTER JOIN Employers ON Person.EmployerId = Employers.EmployerId
(1, 'John Doe', 1, 'Microsoft')
(2, 'Joe Smith', NULL, NULL)
(3, 'Jane Doe', 2, 'google')
Code doesn't have to do anything beyond the DB read.
Inner join on "null" row
TABLE Person (PersonId, Name, EmployerId) (EmployerId is NOT nullable)
(1, 'John Doe', 1)
(2, 'Joe Smith', 0)
(3, 'Jane Doe', 2)
TABLE Employers (EmployerId, Name)
(0, '{NULL}')
(1, 'Microsoft')
(2, 'google')
QUERY
SELECT * FROM Person INNER JOIN Employers ON Person.EmployerId = Employers.EmployerId
(1, 'John Doe', 1, 'Microsoft')
(2, 'Joe Smith', 0, '{NULL}')
(3, 'Jane Doe', 2, 'google')
Code must do DB read, then detect "{NULL}" and convert to null reference.
Upvotes: 0
Views: 4341
Reputation: 1270773
The second version looks weird to me. NULL
means more like "value is missing" than it means "the corresponding record with NULL as an index". If you were going to use the second version, then use some other value than NULL
.
Typically, when a database engine supports outer joins, the algorithm is amazingly similar to the algorithm for inner joins. If the database supports a nested loop inner join, or an index lookup inner join, or a hash-based inner join, or a marge-sort inner join, then it support the same algorithms for outer join. The difference in performance would be more due to the larger result set for the outer join.
By the way, in your now modified second example, the effect is the same as an inner join. The query is:
SELECT *
FROM Person INNER JOIN
Employers
ON Person.EmployerId = Employers.EmployerId
The join
condition fails when either or both EmployerId
s are NULL
. NULL
fails all comparisons, except is null
. So, the second example will not fetch the employer record with NULL
on it.
You can change the on
condition to fix this. Here are two ways:
on (Person.EmployerId = Employers.EmployerId) or
(Person.EmployerId is NULL and Employers.EmployerId is null)
on coalesce(Person.EmployerId, -1) = coalesce(Emlpoyers.EmployerId, -1)
Then the joins are no longer the same in the two cases. And, these joins might perform quite differently from the first case. In particular, or
conditions and functions on columns impede the use of indexes for joins.
Upvotes: 3