Reputation: 521
To simplify the case let's assume there are the following 3 tables
A(a_id), B(b_id,val_b), C(a_id,b_id,val_c)
I need to find all a_id that have certain value pairs from B and C. Example find all a_id which have records (val_b='1' and val_c='2' and B.b_id=C.b_id) AND (val_b='3' and val_c='4' and B.b_id=C.b_id) AND ...
select A.a_id
from A
where (A.a_id in
(select C.a_id
from B, C
where B.b_id=C.b_id and B.val_b='1' and C.val_c='2') and
A.a_id in
(select C.a_id
from B, C
where B.b_id=C.b_id and B.val_b='3' and C.val_c='4') and
A.a_id in
(select C.a_id
from B, C
where B.b_id=C.b_id and B.val_b='5' and C.val_c='6'));
What I noticed is that by adding a few more (val_b,val_c) additional pairs postgres takes substantial time to perform the query. To note thatindexes are present for ids, val_b and val_c.
Is there a way to optimize the query? Tried explicit inner joins but didn't help to improve the performance.
Thanks in advance
More info:
With 3 pairs of criteria - 49483.979 ms <-- performance sparks crazy
Note that separate subquery itself runs under ~62ms.
Update:
Both separate INTERSECT query version suggested below by Vladimir Baranov and the version with having clause that uses bool_or aggregation function by Clodoaldo Neto performed much better. Thank you !
However, the question remains why postgres 8.2 has such a performance spark with original query starting with 3 pair criteria?
BTW, I noticed the same spark also with Vladimir Baranov`s first suggestion to rewrite the query with clean joins. See below:
SELECT A.a_id
FROM
A
INNER JOIN (SELECT C.a_id FROM B INNER JOIN C ON B.b_id=C.b_id WHERE B.val_b='1' and C.val_c='2') Set1 ON Set1.a_id = A.a_id
INNER JOIN (SELECT C.a_id FROM B INNER JOIN C ON B.b_id=C.b_id WHERE B.val_b='3' and C.val_c='4') Set2 ON Set2.a_id = A.a_id
INNER JOIN (SELECT C.a_id FROM B INNER JOIN C ON B.b_id=C.b_id WHERE B.val_b='5' and C.val_c='6') Set3 ON Set3.a_id = A.a_id
;
With 3 sets th query runs quite fast, but as soon as one adds another 3-4 sets the query performance degrades to ~30-40 secs.
Upvotes: 1
Views: 2148
Reputation: 32685
It would be interesting to see if the following runs faster:
SELECT A.a_id
FROM A
WHERE
A.a_id IN
(
SELECT C.a_id
FROM B INNER JOIN C ON B.b_id=C.b_id
WHERE B.val_b='1' and C.val_c='2'
INTERSECT
SELECT C.a_id
FROM B INNER JOIN C ON B.b_id=C.b_id
WHERE B.val_b='3' and C.val_c='4'
INTERSECT
SELECT C.a_id
FROM B INNER JOIN C ON B.b_id=C.b_id
WHERE B.val_b='5' and C.val_c='6'
)
;
Effectively, instead of multiple IN
here are explicit intersection of multiple subsets.
My original answer had a query that was not equivalent to the original query of the question.
Here is SQL Fiddle with some sample data and original query to check that my variant produces same results as original query.
edit
One more path to investigate. If each of the sub-queries run quickly, but INTERSECT
repeated many times in one long query becomes very slow, then you can try to populate a temporary table with the results of sub-queries and then use this temporary table with the main table A
. Effectively, implement INTERSECT
manually one-set-at-a-time using explicit temporary table. Depending on the number of rows returned by sub-queries it may be beneficial to add an index to the temporary table.
update
As for your question why Postgres performance degrades when query becomes complex... Your version of Postgres is rather old and it is unlikely that somebody would be interested enough to investigate in detail. I can offer only some generic thoughts. The latest version most likely would perform differently, there were a lot of changes since 8.2.
In every RDBMS the query optimizer has limited resources and time to analyze the query, so they use a lot of heuristics. As number of joins in the query increases the complexity of the problem to find the optimal execution plan increases exponentially, so there must be a threshold after which the optimizer gives up and picks whatever plan he's got.
You should be able to observe it. Examine the execution plan of the fast query, add another join to make the query slow and compare the plans. Most likely the plans would be very different. You should be able to determine what paths optimizer chooses in each case.
It could be that when given a query with few joins
optimizer is able to transform it to a variant equivalent to using intersect
, but with large number of joins it can't do it any more and just follows the query flow doing join after join. It may even do it so inefficiently, that it ends up doing loop inside the loop inside the loop..., in other words the complexity jumps from linear to quadratic or worse.
So, really, the only answer to such performance questions is: examine the execution plan.
BTW, The latest versions of Postgres have WITH
, which effectively creates a temporary table with intermediate results. It should help in your case greatly, because each of your subqueries is simple and if the system runs all of them separately at first, then it would be easy to combine results together.
Upvotes: 1
Reputation: 125204
select a_id
from
a
inner join
c using (a_id)
inner join
b using (b_id)
group by a_id
having
bool_or((val_b, val_c) = (1,2)) and
bool_or((val_b, val_c) = (3,4)) and
bool_or((val_b, val_c) = (5,6))
http://www.postgresql.org/docs/8.2/static/functions-aggregate.html
Upvotes: 1
Reputation: 4503
JOIN
syntax for clarityEXISTS(...)
instead of IN(...)
for speed and comfortSELECT A.a_id
FROM A
WHERE EXISTS (
SELECT *
FROM B
JOIN C ON B.b_id = C.b_id AND B.val_b = '1'
WHERE C.a_id = A.a_id AND C.val_c = '2'
)
AND EXISTS (
SELECT *
FROM B
JOIN C ON B.b_id = C.b_id AND B.val_b = '3'
WHERE C.a_id = A.a_id AND C.val_c = '4'
)
AND EXISTS (
SELECT *
FROM B
JOIN C ON B.b_id = C.b_id AND B.val_b = '5'
WHERE C.a_id = A.a_id AND C.val_c = '6'
)
;
Upvotes: 0
Reputation: 1521
Each subquery has to hit the indexes again, which increases the overhead of the query by several times. If I understand what your asking, This is a case for the Or operator:
select a.a_id
from A
join c on a.a_id = c.a_id
join b on b.b_id = c.b_id
where
(
(b.val_b = '1' and c.val_c = '2')
or (b.val_b = '3' and c.val_c = '4')
or (b.val_b = '5' and c.val_c = '6')
)
This will give you all A records that are linked to a C record where the c and b values are one of the sets you mentioned. Hope this helps :)
edit Seems it's a many to one:
Select a.a_id
, sum(case when b.val_b = '1' and c.val_c = '2' then 1 else 0 end) as Condition1
, Sum(case when b.val_b = '3' and c.val_c = '4' then 1 else 0 end) as Condition2
, Sum(case when b.val_b = '5' and c.val_c = '6' then 1 else 0 end) as Condition3
from A
join c on a.a_id = c.a_id
join b on b.b_id = c.b_id
group by a.a_id
having sum(case when b.val_b = '1' and c.val_c = '2' then 1 else 0 end) > 0
and Sum(case when b.val_b = '3' and c.val_c = '4' then 1 else 0 end) > 0
and Sum(case when b.val_b = '5' and c.val_c = '6' then 1 else 0 end) > 0
Hopefully that gets you there,
Upvotes: 0