Reputation: 5780
Using PG 9.5, I have a query that joins the rubber
table's FK column to the fuzzy
table's primary id
column. Both columns are indexed with a standard btree index. The rubber
table has over 230MM rows, fuzzy
has over 25MM. When I do a join on these tables and apply a constraint a column in fuzzy
, PG keeps using a sequential scan in the join, and the query takes about 2 minutes.
SELECT * FROM rubber r
JOIN fuzzy fp ON fp.id = r.fuzzy_id
WHERE fp.bean_num IN (73470871);
I've narrowed it down to the join being the sequential, slow part of the query. Namely, the following is very fast, and uses the index:
SELECT * FROM rubber WHERE fuzzy_id = 12345
But when I try something like this, it's just as slow as the JOIN query above:
SELECT * FROM rubber WHERE fuzzy_id IN (
SELECT id FROM fuzzy WHERE bean_num IN (73470871)
);
I'm suspecting that it has to do with the query planner not being able (deciding not?) to use the index when trying to match on some set of foreign keys. The foreign keys are not unique, but not highly duplicated, and none are set to null, so I couldn't take advantage of something like a partial index.
-- 231MM rows
CREATE TABLE rubber (
id bigint DEFAULT nextval('rubber_id_seq1'::regclass) PRIMARY KEY,
context_id integer NOT NULL REFERENCES context(id) ON DELETE CASCADE,
fuzzy_id integer REFERENCES fuzzy(id),
);
CREATE UNIQUE INDEX rubber_pkey1 ON rubber(id int8_ops);
CREATE INDEX rubber_context_id_idx1 ON rubber(context_id int4_ops);
CREATE INDEX rubber_fingerprint_id_idx1 ON rubber(fingerprint_id int4_ops);
CREATE INDEX rubber_conclusion_id_idx1 ON rubber(conclusion_id int4_ops);
CREATE UNIQUE INDEX rubber_id_idx ON rubber(id int8_ops);
CREATE INDEX rubber_fuzzy_id_idx1 ON rubber(fuzzy_id int4_ops);
-- 26.5MM rows
CREATE TABLE fuzzy (
id SERIAL PRIMARY KEY,
trip_id integer NOT NULL REFERENCES trip(id),
device_id integer NOT NULL REFERENCES device(id),
chirp_vision_id integer NOT NULL REFERENCES chirp_vision(id),
mode_id integer NOT NULL REFERENCES mode(id),
fig_id integer NOT NULL REFERENCES fig(id),
gist_id integer NOT NULL REFERENCES gist(id),
bean_num integer REFERENCES bean_num(id),
key_path jsonb NOT NULL,
CONSTRAINT fingerprint_tuple UNIQUE (chirp_vision_id, gist_id, key_path, trip_id, fig_id, device_id, mode_id)
);
CREATE UNIQUE INDEX fuzzy_pkey ON fuzzy(id int4_ops);
CREATE INDEX fuzzy_fig_id_idx ON fuzzy(fig_id int4_ops);
CREATE INDEX fuzzy_gist_id_idx ON fuzzy(gist_id int4_ops);
CREATE INDEX fuzzy_bean_num_idx ON fuzzy(bean_num int4_ops);
CREATE UNIQUE INDEX fingerprint_tuple ON fuzzy(chirp_vision_id int4_ops,gist_id int4_ops,key_path jsonb_ops,trip_id int4_ops,fig_id int4_ops,device_id int4_ops,mode_id int4_ops);
EXPLAIN (BUFFERS,ANALYZE)
:"QUERY PLAN"
"Hash Join (cost=5288.99..6339911.22 rows=15277 width=189) (actual time=82319.995..136625.784 rows=483 loops=1)"
" Hash Cond: (r.fuzzy_id = fp.id)"
" Buffers: shared hit=599 read=3151247"
" -> Seq Scan on rubber r (cost=0.00..5466479.88 rows=231463888 width=80) (actual time=0.078..117561.885 rows=231463887 loops=1)"
" Buffers: shared hit=597 read=3151244"
" -> Hash (cost=5267.11..5267.11 rows=1750 width=109) (actual time=2.251..2.251 rows=23 loops=1)"
" Buckets: 2048 Batches: 1 Memory Usage: 20kB"
" Buffers: shared hit=2 read=3"
" -> Index Scan using fuzzy_bean_num_idx on fuzzy fp (cost=0.44..5267.11 rows=1750 width=109) (actual time=2.220..2.244 rows=23 loops=1)"
" Index Cond: (bean_num = 73470871)"
" Buffers: shared hit=2 read=3"
"Planning time: 0.382 ms"
"Execution time: 136625.875 ms"
Is there a way to get better performance out of a query like this?
There is also an interesting comment in a dba stack exchange comment, suggesting that an index on (fuzzy_id, bean_num) would help, but I don't understand how that would help.
UPDATE: I've migrated to PG 12.3 and this query runs in a couple hundred milliseconds now.
Upvotes: 3
Views: 3627
Reputation: 127521
Question: Why did you create 2 (almost) identical indexes on rubber.id:
CREATE UNIQUE INDEX rubber_pkey1 ON rubber(id int8_ops);
CREATE UNIQUE INDEX rubber_id_idx ON rubber(id int8_ops);
Advice: DROP INDEX rubber_id_idx;
An index that might be very useful for the JOIN, to give the planner better information about the relation between these tables, is this one:
CREATE INDEX fuzzy_bean_num_idx_2 ON fuzzy(bean_num, id);
You might need a different (better) setting for the number of statistics as well. Maybe for just one table, maybe both, maybe the entire system.
Edit: After changing the settings for the statistics, you have to run ANALYZE
for these tables to update the statistics.
Offtopic: Version 9.5 is old and will be EOL within the next few months. Newer versions do behave different and might also solve this performance problem.
Upvotes: 2