Stussa
Stussa

Reputation: 3415

Postgres Proper Index for Sorting and Join

I have a simple schema and query, but am experiencing consistent awful performance with certain parameters.

Schema:

CREATE TABLE locations (
  id integer NOT NULL,
  barcode_id integer NOT NULL
);

CREATE TABLE barcodes (
  id integer NOT NULL,
  value citext NOT NULL
);

ALTER TABLE ONLY locations ADD CONSTRAINT locations_pkey PRIMARY KEY (id);
ALTER TABLE ONLY barcodes ADD CONSTRAINT barcodes_pkey PRIMARY KEY (id);
ALTER TABLE ONLY locations ADD CONSTRAINT fk_locations_barcodes FOREIGN KEY (barcode_id) REFERENCES barcodes(id);

CREATE INDEX index_barcodes_on_value ON barcodes (value);
CREATE INDEX index_locations_on_barcode_id ON locations (barcode_id);

Query:

EXPLAIN ANALYZE
SELECT *
FROM locations
JOIN barcodes ON locations.barcode_id = barcodes.id
ORDER BY barcodes.value ASC
LIMIT 50;

Analysis:

Limit  (cost=0.71..3564.01 rows=50 width=34) (actual time=0.043..683.025 rows=50 loops=1)
  ->  Nested Loop  (cost=0.71..4090955.00 rows=57404 width=34) (actual time=0.043..683.017 rows=50 loops=1)
        ->  Index Scan using index_barcodes_on_value on barcodes  (cost=0.42..26865.99 rows=496422 width=15) (actual time=0.023..218.775 rows=372138 loops=1)
        ->  Index Scan using index_locations_on_barcode_id on locations  (cost=0.29..5.32 rows=287 width=8) (actual time=0.001..0.001 rows=0 loops=372138)
              Index Cond: (barcode_id = barcodes.id)
Planning time: 0.167 ms
Execution time: 683.078 ms

500+ ms for the number of entries in my schema (500,000 barcodes and 60,000 locations) doesn't make sense. Can I do anything to improve the performance?

Note:

Even stranger is the execution time depends on the data. In drafting this question I attempted to include seeded random data, but the seeds seem to be performant:

Seed:

INSERT INTO barcodes (id, value) SELECT seed.id, gen_random_uuid() FROM generate_series(1,500000) AS seed(id);
INSERT INTO locations (id, barcode_id) SELECT seed.id, (RANDOM() * 500000)  FROM generate_series(1,60000) AS seed(id);

Analysis:

Limit  (cost=0.71..3602.63 rows=50 width=86) (actual time=0.089..1.123 rows=50 loops=1)
  ->  Nested Loop  (cost=0.71..4330662.42 rows=60116 width=86) (actual time=0.088..1.115 rows=50 loops=1)
        ->  Index Scan using index_barcodes_on_value on barcodes  (cost=0.42..44972.42 rows=500000 width=41) (actual time=0.006..0.319 rows=376 loops=1)
        ->  Index Scan using index_locations_on_barcode_id on locations  (cost=0.29..5.56 rows=301 width=8) (actual time=0.002..0.002 rows=0 loops=376)
              Index Cond: (barcode_id = barcodes.id)
Planning time: 0.213 ms
Execution time: 1.152 ms

Edit:

Analysis of the tables:

ANALYZE VERBOSE barcodes;
INFO:  analyzing "public.barcodes"
INFO:  "barcodes": scanned 2760 of 2760 pages, containing 496157 live 
rows and 0 dead rows; 30000 rows in sample, 496157 estimated total rows
ANALYZE
Time: 62.937 ms

ANALYZE VERBOSE locations;
INFO:  analyzing "public.locations"
INFO:  "locations": scanned 254 of 254 pages, containing 57394 live rows 
and 0 dead rows; 30000 rows in sample, 57394 estimated total rows
ANALYZE
Time: 21.447 ms

Upvotes: 3

Views: 190

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246268

The problem is that the barcodes with low values don't have matches in locations, which PostgreSQL cannot know. So its plan to fetch the barcodes in the correct output order via the index and then join values from locations until it found 50 of them is much worse than it expected.

I would ANALYZE the tables and

DROP INDEX index_barcodes_on_value;

That should keep PostgreSQL from choosing that plan.

I don't know what plan PostgreSQL will choose then. For a nested loop the following index might help:

CREATE INDEX ON locations(id);

Upvotes: 2

Related Questions