Cesar Bolanos
Cesar Bolanos

Reputation: 13

Postgres doing a sort on simple join

I have two tables in my database (address and person_address). Address has a PK in address_id. person_address has a PK on (address_id, person_id, usage_code)

When joining this two tables through the address_id, my expectation is that the PK index is used on both cases. However, Postgres is adding sort and materialize steps to the plan, which slows down the execution of the query. I have tried dropping indexes (person_address had an index on address_id), analyzing stats, without success.

I will appreciate any help on how to isolate this situation since those queries run slower than expected on our production environment

This is the query:

select * 
from person_addresses pa
join address a
on pa.address_id = a.address_id

This is the plan :

Merge Join  (cost=1506935.96..2416648.39 rows=16033774 width=338)
  Merge Cond: (pa.address_id = ((a.address_id)::numeric))
  ->  Index Scan using person_addresses_pkey on person_addresses pa  (cost=0.43..592822.76 rows=5256374 width=104)
  ->  Materialize  (cost=1506935.53..1526969.90 rows=4006874 width=234)
        ->  Sort  (cost=1506935.53..1516952.71 rows=4006874 width=234)
              Sort Key: ((a.address_id)::numeric)
              ->  Seq Scan on address a  (cost=0.00..163604.74 rows=4006874 width=234)

Thanks.

Edit 1. After the comment checked the data types and found a discrepancy. Fixing the data type changed the plan to the following

Hash Join  (cost=343467.18..881125.47 rows=5256374 width=348)
  Hash Cond: (pa.address_id = a.address_id)
  ->  Seq Scan on person_addresses pa  (cost=0.00..147477.74 rows=5256374 width=104)
  ->  Hash  (cost=159113.97..159113.97 rows=4033697 width=244)
        ->  Seq Scan on address_normalization a  (cost=0.00..159113.97 rows=4033697 width=244)

Performance improvement is evident on the plan, but am wondering if the sequential scans are expected without any filters

Upvotes: 1

Views: 651

Answers (1)

user330315
user330315

Reputation:

So there are two questions here:

why did Postgres choose the (expensive) "Merge Join" in the first query?

The reason for this is that it could not use the more efficient "Hash Join" because the hash values of integer and numeric values would be different. But the Merge join requires that the values are sorted, and that's where the "Sort" step comes from in the first execution plan. Given the number of rows a "Nested Loop" would have been even more expensive.

The second question is:

I am wondering if the sequential scans are expected without any filters

Yes they are expected. The query retrieves all matching rows from both tables and that is done most efficiently by scanning all rows. An index scan requires about 2-3 I/O operations per row that has to be retrieved. A sequential scan usually requires less than one I/O operation as one block (which is the smallest unit the database reads from the disk) contains multiple rows.

You can run explain (analyze, buffers) to see how much "logical reads" each step takes.

Upvotes: 1

Related Questions