Reputation: 8275
I am writing a query which uses aggregate functions to reduce data duplication, as the query joins twelve tables.
Consider the query (simplified to show the bottleneck):
SELECT r.source_uri AS su_on_r,
tag.voted_tag AS vt_on_tag,
tag.votes AS v_on_tag,
FROM release r
INNER JOIN release_barcode barcode
ON r.source_uri = barcode.source_uri AND barcode.barcode IN ( '75992731324', '075992731324', '0075992731324')
LEFT JOIN (
SELECT source_uri, string_agg(voted_tag, '|') as voted_tag, string_agg(votes::text, '|') as votes
FROM release_voted_tag
GROUP BY source_uri
) tag
ON r.source_uri = tag.source_uri
The filter on the release_barcode restricts the number of r
s to 21 (from around 8m).
The plan for the LEFT JOIN
looks like:
-> Merge Left Join (cost=1461.05..157205.05 rows=125 width=242) (actual time=23.322..1994.827 rows=21 loops=1)
Merge Cond: ((r.source_uri)::text = (release_voted_tag.source_uri)::text)
-> Sort (cost=1460.50..1460.81 rows=125 width=178) (actual time=0.974..0.991 rows=21 loops=1)
Sort Key: r.source_uri
Sort Method: quicksort Memory: 30kB
-> Nested Loop (cost=0.99..1456.15 rows=125 width=178) (actual time=0.071..0.870 rows=21 loops=1)
-> Index Scan using release_barcode_barcode_idx on release_barcode barcode (cost=0.43..382.71 rows=125 width=62) (actual time=0.029..0.061 rows=21 loops=1)
Index Cond: ((barcode)::text = ANY ('{75992731324,075992731324,0075992731324}'::text[]))
-> Index Scan using release_source_uri_idx on release r (cost=0.56..8.58 rows=1 width=169) (actual time=0.037..0.037 rows=1 loops=21)
Index Cond: ((source_uri)::text = (barcode.source_uri)::text)
-> Materialize (cost=0.55..155340.82 rows=161233 width=132) (actual time=0.026..1625.598 rows=321318 loops=1)
-> GroupAggregate (cost=0.55..153325.41 rows=161233 width=132) (actual time=0.024..1446.457 rows=321318 loops=1)
Group Key: release_voted_tag.source_uri
-> Index Scan using release_voted_tag_source_uri_idx on release_voted_tag (cost=0.55..136510.34 rows=1151726 width=82) (actual time=0.007..647.964 rows=1151726 loops=1)
Here's the plan in full (which shows the full query including filter clause) :
https://explain.depesz.com/s/dkNC
It looks to me like the issue is the number of rows returned by the LEFT JOIN
.
The number, over 1m, is nowhere near the amount I would expect to see if the filter for the r
s is applied. I would expect 84 rows returned, which would be the equivalent of:
select release_barcode.source_uri,voted_tag
from release_barcode,release_voted_tag
where release_voted_tag.source_uri=release_barcode.source_uri and barcode IN ( '75992731324', '075992731324', '0075992731324');
I assumed this would work to restrict the number of records selected from release_voted_tag
because of the ON
filter applied outside of the query.
As mentioned there are other 1:M joins involved. I originally wrote this as:
SELECT r.source_uri AS su_on_r,
string_agg(tag.voted_tag, '|') AS vt_on_tag,
string_agg(tag.votes::text, '|') AS v_on_tag,
t.title,
string_agg(distinct tComposer.composer, '|') AS c_on_tComposer
FROM release r
JOIN release_barcode barcode
ON r.source_uri = barcode.source_uri
AND barcode.barcode IN ( '75992731324', '075992731324', '0075992731324')
LEFT JOIN release_voted_tag tag
ON r.source_uri = tag.source_uri
LEFT JOIN medium m
ON r.source_uri = m.source_uri
LEFT JOIN track t
ON m.id = t.medium
LEFT JOIN track_composer tComposer
ON t.id = tComposer.track
GROUP BY r.source_uri, t.title;
However because of the join to the track_composer
via medium
and track
we end up with multiple release_voted_tag
rows that are aggregated when there are multiple track_composer
s. For example, if there are two track_composer
s the string_agg(tag.voted_tag, '|')
is repeated.
Note we'd have to be careful about using distinct
because the tag.voted_tag
and tag.votes
must be correlated later.
I found I could fix that with a correlated subquery for track_composer
which performs the aggregation, but this is not very performant is it? It runs for each row?
And this is why I moved to a subquery in a join, because that way I could place the aggregation in the join and be assured that only one row would be returned, making the join to other 1:M tables... sane.
Why is an expensive Merge Left Join
enacted and how can I make this more performant?
Upvotes: 0
Views: 277
Reputation: 92
The reason you are getting a million rows is that the inner query is being run first and then the filter is being applied. If you want the filter to be applied first then you will need to either add it to the inner query (which is probably not a very good idea since it will create a correlated query) rewrite the join as noted in one of the other replies (lad2025's)
Upvotes: 0
Reputation: 176144
I would rewrite it as:
SELECT r.source_uri AS su_on_r,
string_agg(tag.voted_tag, '|') AS vt_on_tag,
string_agg(tag.votes::text, '|') AS v_on_tag,
FROM release r
JOIN release_barcode barcode
ON r.source_uri = barcode.source_uri
AND barcode.barcode IN ( '75992731324', '075992731324', '0075992731324')
LEFT JOIN release_voted_tag tag
ON r.source_uri = tag.source_uri
GROUP BY r.source_uri;
or even:
SELECT r.source_uri AS su_on_r,
string_agg(tag.voted_tag, '|') AS vt_on_tag,
string_agg(tag.votes::text, '|') AS v_on_tag,
FROM release r
LEFT JOIN release_voted_tag tag
ON r.source_uri = tag.source_uri
WHERE r.source_uri IN (SELECT source_uri FROM release_barcode WHERE
barcode IN ('75992731324','075992731324', '0075992731324')
GROUP BY r.source_uri;
Upvotes: 1