Steve Estes
Steve Estes

Reputation: 123

Multi-column join + aggregate subquery resulting in infinite run time

I thought I knew what I was doing with queries, but this has more-or-less ruined my 4th of July holiday so I think it's high time to ask for help. I imagine it's something trivial and obvious but it's escaping me for now.

I've got 3 tables in question here:

prod_account_details => 538k records

line_number PK,
acct_id integer, -- nearly unique, but not quite; acct_id PLUS address is unique
address varchar(100),
entity_name varchar(100),
(... a few other fields ...)

Indexes: btree on acct_id, hash on address, also btree on (acct_id, address).

prod_customer_profiles => 532k records, nearly 1:1 with account details

acct_id integer, -- corresponds to account_details.acct_id
match_input_address varchar(100), -- corresponds to account_details.address
(... lots of other fields ...)

Indexes: btree on acct_id, hash on match_input_address, also btree on (acct_id, match_input_address).

prod_billing_letters => 518k records, 1:M with account_details (covering ~181k acct_ids)

line_number PK,
acct_id integer, -- FK to account_details, not formally via constraint but semantically
(... a few other fields ...)

Indexes: btree on acct_id.

--- Queries ---

So firstly, what works:

(1)

SELECT ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match
FROM prod_account_details ad
INNER JOIN prod_customer_profiles acx ON ad.acct_id = acx.acct_id 
    AND ad.address = acx.match_input_address;

That returns the 532k matched records in 3.0 seconds.

Now if we INNER join to the billing_letters table via a grouped subquery:

(2)

SELECT ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match, dl.num_records AS dl_count
FROM prod_account_details ad
    INNER JOIN prod_customer_profiles acx ON ad.acct_id = acx.acct_id 
        AND ad.address = acx.match_input_address
    INNER JOIN (SELECT count(*) AS num_records, acct_id FROM prod_billing_letters GROUP BY acct_id ) dl ON ad.acct_id = dl.acct_id;

That it returns fine too, 161k records in 1.3 seconds.

But now if I do something as simple as changing that second join (to the subquery) to a LEFT JOIN, it blows up. i.e. I want all valid accounts, appended with the count of records in the billing_letters table if any exist:

(3)

SELECT ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match, dl.num_records AS dl_count
FROM prod_account_details ad
    INNER JOIN prod_customer_profiles acx ON ad.acct_id = acx.acct_id 
        AND ad.address = acx.match_input_address
    LEFT JOIN (SELECT count(*) AS num_records, acct_id FROM prod_billing_letters GROUP BY acct_id ) dl ON ad.acct_id = dl.acct_id;

What happens is, this runs for 20+ seconds, and then starts returning ~100 records every 10 seconds, which at the volume I'm talking about (>500k) would take forever to return. And next up I need to bolt on the counts for some other, bigger tables too, so we're quickly approaching heat-death-of-the-universe amounts of runtime here.

Now what's interesting is, if I relax the join on the first two tables, i.e. drop the address part of the join and only do the acct_id one,

(4)

SELECT ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match, dl.num_records AS dl_count
FROM prod_account_details ad
    INNER JOIN prod_customer_profiles acx ON ad.acct_id = acx.acct_id 
    LEFT JOIN (SELECT count(*) AS num_records, acct_id FROM prod_billing_letters GROUP BY acct_id ) dl ON ad.acct_id = dl.acct_id;

That it starts returning immediately, and spools 695k records in 4.7 seconds. The problem is that we need the address part of the join in there, because there are some duplicate acct_ids on both sides so we get a bunch of cartesian-join junk.

--- EXPLAIN query plans ---

So here's the Explain for query (1), with just the two tables:

Gather  (cost=102995.20..123944.32 rows=1 width=41)
  Workers Planned: 2
  ->  Parallel Hash Join  (cost=101995.20..122944.22 rows=1 width=41)
        Hash Cond: ((ad.acct_id = acx.acct_id) AND (ad.address = (acx.match_input_address)::text))
        ->  Parallel Seq Scan on prod_account_details ad  (cost=0.00..14226.43 rows=224343 width=39)
        ->  Parallel Hash  (cost=97096.88..97096.88 rows=224288 width=26)
              ->  Parallel Seq Scan on prod_customer_profiles acx  (cost=0.00..97096.88 rows=224288 width=26)

Now, when we add on the third table, its strategy for that join changes. Here's the Explain for (2), with the inner join that works:

Gather  (cost=22500.40..83636.43 rows=1 width=49)
  Workers Planned: 2
  ->  Nested Loop  (cost=21500.40..82636.33 rows=1 width=49)
        ->  Hash Join  (cost=21500.40..40427.74 rows=56285 width=54)
              Hash Cond: (ad.acct_id = prod_billing_letters.acct_id)
              ->  Parallel Seq Scan on prod_account_details ad  (cost=0.00..14226.43 rows=224343 width=39)
              ->  Hash  (cost=19345.40..19345.40 rows=123920 width=15)
                    ->  GroupAggregate  (cost=0.42..18106.20 rows=123920 width=15)
                          Group Key: prod_billing_letters.acct_id
                          ->  Index Only Scan using prod_billing_letters_acct_id_idx on prod_billing_letters  (cost=0.42..14269.36 rows=519529 width=7)
        ->  Index Scan using prod_customer_profiles_match_input_address_idx on prod_customer_profiles acx  (cost=0.00..0.74 rows=1 width=26)
              Index Cond: ((match_input_address)::text = ad.address)
              Filter: (ad.acct_id = acct_id)

But then change it to a LEFT JOIN from query (3), the problem child, and it changes to:

Nested Loop Left Join  (cost=102995.63..144838.72 rows=1 width=49)
  Join Filter: (ad.acct_id = prod_billing_letters.acct_id)
  ->  Gather  (cost=102995.20..123944.32 rows=1 width=41)
        Workers Planned: 2
        ->  Parallel Hash Join  (cost=101995.20..122944.22 rows=1 width=41)
              Hash Cond: ((ad.acct_id = acx.acct_id) AND (ad.address = (acx.match_input_address)::text))
              ->  Parallel Seq Scan on prod_account_details ad  (cost=0.00..14226.43 rows=224343 width=39)
              ->  Parallel Hash  (cost=97096.88..97096.88 rows=224288 width=26)
                    ->  Parallel Seq Scan on prod_customer_profiles acx  (cost=0.00..97096.88 rows=224288 width=26)
  ->  GroupAggregate  (cost=0.42..18106.20 rows=123920 width=15)
        Group Key: prod_billing_letters.acct_id
        ->  Index Only Scan using prod_billing_letters_acct_id_idx on prod_billing_letters  (cost=0.42..14269.36 rows=519529 width=7)

Just reading that, I can't readily tell why #2 finishes fast, whereas #3 runs without end. If the difference was JUST the use of a left join to the third table, then why does #4 finish easily too, just with a single-column join between the first two tables?

This has befuddled the rest of my experimentation with it, so I'd really appreciate some insight.

Upvotes: 0

Views: 50

Answers (1)

MatBailie
MatBailie

Reputation: 86798

I can't see anything obviously wrong, but I can suggest an alternative approach to trial...

SELECT
  ad.acct_id,
  ad.entity_name,
  ad.address,
  acx.flag_infobase_match,
  (
    SELECT COUNT(*) AS num_records
      FROM prod_billing_letters
     WHERE acct_id = ad.acct_id
  )
    AS dl_count
FROM
  prod_account_details     ad
INNER JOIN
  prod_customer_profiles   acx
    ON  ad.acct_id = acx.acct_id
    AND ad.address = acx.match_input_address

Upvotes: 1

Related Questions