newToScala
newToScala

Reputation: 407

SQL JOIN with WHERE clause on multiple tables slowdown

The following query is fast:

SELECT a.id, b.id, c.id
FROM a
FULL OUTER JOIN b ON a.id = b.id
FULL OUTER JOIN c ON a.id = c.id
WHERE a.some > 5 AND a.thing < 10

however doing the where clause on multiple tables is causing about a 10 million X performance penalty in my dataset:

SELECT a.id, b.id, c.id
FROM a
FULL OUTER JOIN b ON a.id = b.id
FULL OUTER JOIN c ON a.id = c.id
WHERE (a.some > 5 AND a.thing < 10)
    OR (b.some > 5 AND b.thing < 10)
    OR (c.some > 5 AND c.thing < 10)

How can I improve the query to be performant? Thank you!

EDIT:

here's the sql explain on the actual query (table names are a bit different):

SELECT
    ohh.hour
FROM org_hour_host ohh
FULL OUTER JOIN org_hour_timeseries ohs ON ohh.org_id = ohs.org_id
FULL OUTER JOIN org_hour_vs_host ohah ON ohh.org_id = ohah.org_id
WHERE (ohh.org_id IN (10) OR ohs.org_id IN (10) OR ohah.org_id IN (10))

XN Hash Full Join DS_DIST_OUTER  (cost=6682944.40..234919986923528960.00 rows=1934276754413 width=8)
  Outer Dist Key: "outer".org_id
  Hash Cond: ("outer".org_id = "inner".org_id)
  Filter: (("inner".org_id = 10) OR ("outer".org_id = 10) OR ("outer".org_id = 10))
  ->  XN Hash Full Join DS_DIST_NONE  (cost=3050316.80..38694799792.93 rows=1934276754413 width=16)
        Hash Cond: ("outer".org_id = "inner".org_id)
        ->  XN Seq Scan on org_hour_host ohh  (cost=0.00..3130270.08 rows=313027008 width=12)
        ->  XN Hash  (cost=2440253.44..2440253.44 rows=244025344 width=4)
              ->  XN Seq Scan on org_hour_timeseries ohs  (cost=0.00..2440253.44 rows=244025344 width=4)
  ->  XN Hash  (cost=2906102.08..2906102.08 rows=290610208 width=4)
        ->  XN Seq Scan on org_hour_vs_host ohah  (cost=0.00..2906102.08 rows=290610208 width=4)
(11 rows)





SELECT
    ohh.hour
FROM org_hour_host ohh
FULL OUTER JOIN org_hour_timeseries ohs ON ohh.org_id = ohs.org_id
FULL OUTER JOIN org_hour_vs_host ohah ON ohh.org_id = ohah.org_id
WHERE ohh.org_id IN (10)

XN Merge Left Join DS_DIST_NONE  (cost=0.00..6350089909.81 rows=634262751009 width=8)
  Merge Cond: ("outer".org_id = "inner".org_id)
  ->  XN Merge Left Join DS_DIST_NONE  (cost=0.00..3667829.03 rows=64777233 width=12)
        Merge Cond: ("outer".org_id = "inner".org_id)
        ->  XN Seq Scan on org_hour_host ohh  (cost=0.00..131.03 rows=10483 width=12)
              Filter: (org_id = 10)
        ->  XN Seq Scan on org_hour_timeseries ohs  (cost=0.00..2440253.44 rows=244025344 width=4)
  ->  XN Seq Scan on org_hour_vs_host ohah  (cost=0.00..2906102.08 rows=290610208 width=4)
(8 rows)

Upvotes: 0

Views: 3004

Answers (1)

itsLex
itsLex

Reputation: 786

In your first query the clause a.some > 5 AND a.thing < 10 excludes rows where a.some or a.thing are NULL. This makes the joins LEFT joins. In the second query a.some and a.thing CAN be NULL if for example b.some > 5 AND b.thing < 10 is true. So now the FULL JOIN are actual FULL JOINs delivering many more rows. Finally ORs in WHERE clauses are relatively slow.

Theoretically you could apply the conditions first before joining so there are less rows to filter and less rows to join. Untested, but this would look like:

SELECT a.id, b.id, c.id
FROM (SELECT * FROM tbl_a 
      WHERE some > 5 AND thing < 10) a
FULL OUTER JOIN (SELECT * FROM tbl_b 
      WHERE some > 5 AND thing < 10) b ON a.id = b.id
FULL OUTER JOIN (SELECT * FROM tbl_c 
      WHERE some > 5 AND thing < 10) c ON a.id = c.id

Upvotes: 1

Related Questions