Freddy
Freddy

Reputation: 876

Efficient JOIN in Postgres

I have three large-ish tables, A (10 million rows), A_C (60 million rows) and C (6 rows). They are related via bigint fields and with primary foreign key relations. The tables have a n:m relation, this is why I added the A_C table. The schema is as follow:

A.id = A_C.a_id and C.id = A_C.c_id

A is the primary table here. I want to display the first 100 entries from A and all corresponding rows from C via A_C. Currently the query takes very long. My first approach were two simple joins. Running EXPLAIN ANALYZE I found out, that SQL does a sequential scan on A. I think this is because it applies the limit at last and simply disregards it. This performs very poorly. Is there any way to work around this LIMIT or another solution I am missing here? When I perform a LIMIT on a simple select query, it performs just fine in ~2ms.

Edit: A.id is primary key and has a btree index. A_C has a unique constraint on the entire row and also an index on A_C.a_id and A_C.c_id. I have not yet joined C as it is already really slow.

Fast queries:

Slow query:

The following is fast in query planning but slow in execution since the data amount is very large:

I tried to avoid that by setting a limit on how many rows from A are retrieved:

SELECT a.id FROM (SELECT A.id AS id FROM metrics.metrics LIMIT 10) a JOIN A_C ON (a.id = A_C.a_id)

The query plan via EXPLAIN ANALYZE looks as follows:

"Hash Join  (cost=5.75..1305897.53 rows=44790854 width=8) (actual time=251969.648..251969.648 rows=0 loops=1)"
"  Hash Cond: (a_c.a_id = a.id)"
"  ->  Seq Scan on a_c  (cost=0.00..690017.54 rows=44790854 width=8) (actual time=0.472..126203.253 rows=44790000 loops=1)"
"  ->  Hash  (cost=4.50..4.50 rows=100 width=8) (actual time=1.889..1.889 rows=100 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 12kB"
"        ->  Limit  (cost=0.00..3.50 rows=100 width=8) (actual time=0.527..1.550 rows=100 loops=1)"
"              ->  Seq Scan on a  (cost=0.00..261275.88 rows=7465088 width=8) (actual time=0.520..0.912 rows=100 loops=1)"
"Planning time: 16.350 ms"
"Execution time: 251969.693 ms"

Upvotes: 1

Views: 1907

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269773

I would suggest a subquery for the rows from a:

select . . .
from (select a.*
      from a
      order by a.id
      fetch first 100 rows only    -- or limit 100
     ) a join
     a_c ac
     on ac.a_id = a.id join
     c
     on ac.c_id = c.id;

Even though c is really small, I would encourage a primary key on c(id) if you don't already have one.

This assumes that "the first hundred rows" means by the ordering of the primary key.

Upvotes: 2

Related Questions