Dan Gravell
Dan Gravell

Reputation: 8275

Join subquery leads to index scan over large number of rows

I am writing a query which uses aggregate functions to reduce data duplication, as the query joins twelve tables.

Consider the query (simplified to show the bottleneck):

SELECT r.source_uri       AS su_on_r, 
       tag.voted_tag      AS vt_on_tag,
       tag.votes      AS v_on_tag,  
FROM   release r 
       INNER JOIN release_barcode barcode 
          ON r.source_uri = barcode.source_uri AND barcode.barcode IN ( '75992731324', '075992731324', '0075992731324')
       LEFT JOIN (
              SELECT source_uri, string_agg(voted_tag, '|') as voted_tag, string_agg(votes::text, '|') as votes
              FROM release_voted_tag
              GROUP BY source_uri
              ) tag
              ON r.source_uri = tag.source_uri 

The filter on the release_barcode restricts the number of rs to 21 (from around 8m).

The plan for the LEFT JOIN looks like:

->  Merge Left Join  (cost=1461.05..157205.05 rows=125 width=242) (actual time=23.322..1994.827 rows=21 loops=1)    
      Merge Cond: ((r.source_uri)::text = (release_voted_tag.source_uri)::text)                                     
      ->  Sort  (cost=1460.50..1460.81 rows=125 width=178) (actual time=0.974..0.991 rows=21 loops=1)               
            Sort Key: r.source_uri             
            Sort Method: quicksort  Memory: 30kB                                                                    
            ->  Nested Loop  (cost=0.99..1456.15 rows=125 width=178) (actual time=0.071..0.870 rows=21 loops=1)     
                  ->  Index Scan using release_barcode_barcode_idx on release_barcode barcode  (cost=0.43..382.71 rows=125 width=62) (actual time=0.029..0.061 rows=21 loops=1)          
                        Index Cond: ((barcode)::text = ANY ('{75992731324,075992731324,0075992731324}'::text[]))    
                  ->  Index Scan using release_source_uri_idx on release r  (cost=0.56..8.58 rows=1 width=169) (actual time=0.037..0.037 rows=1 loops=21)                                
                        Index Cond: ((source_uri)::text = (barcode.source_uri)::text)                               
      ->  Materialize  (cost=0.55..155340.82 rows=161233 width=132) (actual time=0.026..1625.598 rows=321318 loops=1)                                                                    
            ->  GroupAggregate  (cost=0.55..153325.41 rows=161233 width=132) (actual time=0.024..1446.457 rows=321318 loops=1)                                                           
                  Group Key: release_voted_tag.source_uri                                                           
                  ->  Index Scan using release_voted_tag_source_uri_idx on release_voted_tag  (cost=0.55..136510.34 rows=1151726 width=82) (actual time=0.007..647.964 rows=1151726 loops=1)  

Here's the plan in full (which shows the full query including filter clause) :

https://explain.depesz.com/s/dkNC

It looks to me like the issue is the number of rows returned by the LEFT JOIN.

The number, over 1m, is nowhere near the amount I would expect to see if the filter for the rs is applied. I would expect 84 rows returned, which would be the equivalent of:

select release_barcode.source_uri,voted_tag 
from release_barcode,release_voted_tag 
where release_voted_tag.source_uri=release_barcode.source_uri and barcode IN ( '75992731324', '075992731324', '0075992731324');

I assumed this would work to restrict the number of records selected from release_voted_tag because of the ON filter applied outside of the query.

Original try

As mentioned there are other 1:M joins involved. I originally wrote this as:

SELECT r.source_uri                        AS su_on_r, 
       string_agg(tag.voted_tag, '|')      AS vt_on_tag,
       string_agg(tag.votes::text, '|')    AS v_on_tag,  
       t.title,
       string_agg(distinct tComposer.composer, '|') AS c_on_tComposer 
FROM release r 
JOIN release_barcode barcode 
  ON r.source_uri = barcode.source_uri 
 AND barcode.barcode IN ( '75992731324', '075992731324', '0075992731324')
LEFT JOIN release_voted_tag tag
  ON r.source_uri = tag.source_uri 
LEFT JOIN medium m 
  ON r.source_uri = m.source_uri 
LEFT JOIN track t 
  ON m.id = t.medium 
LEFT JOIN track_composer tComposer 
  ON t.id = tComposer.track
GROUP BY r.source_uri, t.title;

However because of the join to the track_composer via medium and track we end up with multiple release_voted_tag rows that are aggregated when there are multiple track_composers. For example, if there are two track_composers the string_agg(tag.voted_tag, '|') is repeated.

Note we'd have to be careful about using distinct because the tag.voted_tag and tag.votes must be correlated later.

I found I could fix that with a correlated subquery for track_composer which performs the aggregation, but this is not very performant is it? It runs for each row?

And this is why I moved to a subquery in a join, because that way I could place the aggregation in the join and be assured that only one row would be returned, making the join to other 1:M tables... sane.

So the question then...

Why is an expensive Merge Left Join enacted and how can I make this more performant?

Upvotes: 0

Views: 277

Answers (2)

alaniane
alaniane

Reputation: 92

The reason you are getting a million rows is that the inner query is being run first and then the filter is being applied. If you want the filter to be applied first then you will need to either add it to the inner query (which is probably not a very good idea since it will create a correlated query) rewrite the join as noted in one of the other replies (lad2025's)

Upvotes: 0

Lukasz Szozda
Lukasz Szozda

Reputation: 176144

I would rewrite it as:

SELECT r.source_uri                        AS su_on_r, 
       string_agg(tag.voted_tag, '|')      AS vt_on_tag,
       string_agg(tag.votes::text, '|')    AS v_on_tag,  
FROM release r 
JOIN release_barcode barcode 
  ON r.source_uri = barcode.source_uri 
 AND barcode.barcode IN ( '75992731324', '075992731324', '0075992731324')
LEFT JOIN release_voted_tag tag
  ON r.source_uri = tag.source_uri 
GROUP BY r.source_uri;

or even:

SELECT r.source_uri                        AS su_on_r, 
       string_agg(tag.voted_tag, '|')      AS vt_on_tag,
       string_agg(tag.votes::text, '|')    AS v_on_tag,  
FROM release r 
LEFT JOIN release_voted_tag tag
  ON r.source_uri = tag.source_uri 
WHERE r.source_uri IN (SELECT source_uri FROM release_barcode WHERE
                       barcode IN ('75992731324','075992731324', '0075992731324')
GROUP BY r.source_uri;

Upvotes: 1

Related Questions