David
David

Reputation: 1065

SELECT ... WHERE IN vs JOIN

Assume we have 4 tables A, B, C, D in some unspecified relational SQL database. A references B, also C and D as well. References means that A has column A.refX_id = X.id, X is A, B and C (common foreign key 1:N).

What I want, is to query table A with condition based on columns from all child tables B, C and D. My question is: What of the following variants is generally better? (In terms of usability, efficiency, speed.)

Variant 1:

SELECT DISTINCT A.* FROM A
   JOIN B ON A.refB_id = B.id
   JOIN C ON A.refC_id = C.id
   JOIN D ON A.refD_id = D.id
   WHERE <condition on B> AND <condition on C> AND <condition on D>;

Which I like more from perspective of a database, but looks a little bit harder to program.

Variant 2:

SELECT id FROM B WHERE <condition on B>; # result store to array "BIds" on program side
SELECT id FROM C WHERE <condition on C>; # result store to array "CIds" on program side
SELECT id FROM D WHERE <condition on D>; # result store to array "DIds" on program side

SELECT A.* FROM A
   WHERE refB_id IN (<B_ids>) AND refC_id IN (<C_ids>) AND refD_id IN (<D_ids>);

# <B_ids> menas expand whole array of ids, which can result in a very long query string

I thought that Variant 2 is completelly smut and unusable with potentially large data. But I've heard, that a lot of frameworks normally use it, because it is relativelly straightforward. Is it sort of legal way to query data like this in general case, if I know that the content of the "IN" clause is taken from a result of another query(ies)?

Upvotes: 3

Views: 3478

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269803

I would encourage you to use IN or EXISTS:

SELECT A.*
FROM A
WHERE EXISTS (SELECT 1 FROM B WHERE A.refB_id = B.id AND <condition on B>) AND
      EXISTS (SELECT 1 FROM C WHERE A.refC_id = C.id AND <condition on C>) AND
      EXISTS (SELECT 1 FROM D WHERE A.refD_id = D.id AND <condition on D>);

Advantages of this approach:

  • No danger of getting large intermediate Cartesian products.
  • No expense of eliminating duplicates for SELECT DISTINCT.
  • Most databases handle EXISTS pretty well.
  • You can optimize each subquery using indexes.

EDIT:

You can write this with IN and subqueries:

SELECT A.*
FROM A
WHERE A.refB_id IN (SELECT B.id FROM B WHERE <condition on B>) AND
      A.refC_id IN (SELECT C.id FROM C WHERE <condition on C>) AND
      A.refD_id IN (SELECT D.id FROM D WHERE <condition on D>);

Upvotes: 4

Rahul
Rahul

Reputation: 77876

Not sure which framework uses the 2nd approach but first approach is what I will go for and moreover everyone else will. If you have proper index created on the join column for all tables then 1st approach will produce a much better plan than the 2nd one cause you have multiple IN clause and what if every IN has to work on millions of elements???.

Also, I would change the INNER JOIN to LEFT JOIN assuming that not all ID's may match and move the WHERE conditions to JOIN ON conditions like

SELECT DISTINCT A.* FROM A
   LEFT JOIN B ON A.refB_id = B.id AND <condition on B>
   LEFT JOIN C ON A.refC_id = C.id AND <condition on C>
   LEFT JOIN D ON A.refD_id = D.id AND <condition on D>;

Upvotes: 3

Related Questions