Reputation: 407
The following query is fast:
SELECT a.id, b.id, c.id
FROM a
FULL OUTER JOIN b ON a.id = b.id
FULL OUTER JOIN c ON a.id = c.id
WHERE a.some > 5 AND a.thing < 10
however doing the where clause on multiple tables is causing about a 10 million X performance penalty in my dataset:
SELECT a.id, b.id, c.id
FROM a
FULL OUTER JOIN b ON a.id = b.id
FULL OUTER JOIN c ON a.id = c.id
WHERE (a.some > 5 AND a.thing < 10)
OR (b.some > 5 AND b.thing < 10)
OR (c.some > 5 AND c.thing < 10)
How can I improve the query to be performant? Thank you!
EDIT:
here's the sql explain on the actual query (table names are a bit different):
SELECT
ohh.hour
FROM org_hour_host ohh
FULL OUTER JOIN org_hour_timeseries ohs ON ohh.org_id = ohs.org_id
FULL OUTER JOIN org_hour_vs_host ohah ON ohh.org_id = ohah.org_id
WHERE (ohh.org_id IN (10) OR ohs.org_id IN (10) OR ohah.org_id IN (10))
XN Hash Full Join DS_DIST_OUTER (cost=6682944.40..234919986923528960.00 rows=1934276754413 width=8)
Outer Dist Key: "outer".org_id
Hash Cond: ("outer".org_id = "inner".org_id)
Filter: (("inner".org_id = 10) OR ("outer".org_id = 10) OR ("outer".org_id = 10))
-> XN Hash Full Join DS_DIST_NONE (cost=3050316.80..38694799792.93 rows=1934276754413 width=16)
Hash Cond: ("outer".org_id = "inner".org_id)
-> XN Seq Scan on org_hour_host ohh (cost=0.00..3130270.08 rows=313027008 width=12)
-> XN Hash (cost=2440253.44..2440253.44 rows=244025344 width=4)
-> XN Seq Scan on org_hour_timeseries ohs (cost=0.00..2440253.44 rows=244025344 width=4)
-> XN Hash (cost=2906102.08..2906102.08 rows=290610208 width=4)
-> XN Seq Scan on org_hour_vs_host ohah (cost=0.00..2906102.08 rows=290610208 width=4)
(11 rows)
SELECT
ohh.hour
FROM org_hour_host ohh
FULL OUTER JOIN org_hour_timeseries ohs ON ohh.org_id = ohs.org_id
FULL OUTER JOIN org_hour_vs_host ohah ON ohh.org_id = ohah.org_id
WHERE ohh.org_id IN (10)
XN Merge Left Join DS_DIST_NONE (cost=0.00..6350089909.81 rows=634262751009 width=8)
Merge Cond: ("outer".org_id = "inner".org_id)
-> XN Merge Left Join DS_DIST_NONE (cost=0.00..3667829.03 rows=64777233 width=12)
Merge Cond: ("outer".org_id = "inner".org_id)
-> XN Seq Scan on org_hour_host ohh (cost=0.00..131.03 rows=10483 width=12)
Filter: (org_id = 10)
-> XN Seq Scan on org_hour_timeseries ohs (cost=0.00..2440253.44 rows=244025344 width=4)
-> XN Seq Scan on org_hour_vs_host ohah (cost=0.00..2906102.08 rows=290610208 width=4)
(8 rows)
Upvotes: 0
Views: 3004
Reputation: 786
In your first query the clause a.some > 5 AND a.thing < 10
excludes rows where a.some
or a.thing
are NULL. This makes the joins LEFT joins.
In the second query a.some
and a.thing
CAN be NULL if for example b.some > 5 AND b.thing < 10
is true. So now the FULL JOIN are actual FULL JOINs delivering many more rows. Finally ORs in WHERE clauses are relatively slow.
Theoretically you could apply the conditions first before joining so there are less rows to filter and less rows to join. Untested, but this would look like:
SELECT a.id, b.id, c.id
FROM (SELECT * FROM tbl_a
WHERE some > 5 AND thing < 10) a
FULL OUTER JOIN (SELECT * FROM tbl_b
WHERE some > 5 AND thing < 10) b ON a.id = b.id
FULL OUTER JOIN (SELECT * FROM tbl_c
WHERE some > 5 AND thing < 10) c ON a.id = c.id
Upvotes: 1