Postgres Differing Query Plans Prod/QA

Question

I have two database servers that are both in "test" mode with one expected to be promoted to the production server. As such, the specs differ and some of the configs, but we are finding that the underpowered server produces better query plans and thus, faster queries.

Stats:

Both systems have roughly the same data and the amount looks like this:

Size     |  Part
--------------------
1.47 TB  |  Entire DB
871 GB   |  Tables
635 GB   |  Indexes

The bigger db server has the following specs:

RAM: 500 GB

CPU: 16 Cores 2.0 GHz Intel

Using SSDs

Postgres 10.0

Memlock set to reserve 485 GB specifically to Postgres

Postgres Settings:

shared_buffers: 125 GB

work_mem: 36 MB

effective_cache_size: 300 GB

random_page_cost: 1

default_statistics_target: 1000

Query Plan: https://explain.depesz.com/s/9Ww6

The smaller server has the following stats:

RAM: 281 GB

CPU: 4 Cores 2.0 GHz Intel

Using SSDs

Postgres 10.0

Memlock set to reserve 240 GB specifically to Postgres

Postgres Settings:

shared_buffers: 50 GB

work_mem: 25.6 MB

effective_cache_size: 150 GB

random_page_cost: 4

default_statistics_target: 100

Query Plan: https://explain.depesz.com/s/4WUH

We've tried switching random_page_cost, default statistics (followed by analyze) and work memory to match each other. The biggest gain came after running a vacuum full on all tables in the query.

Workload: This machine is a read replica used to extract files of the data as XML files, etc. So it receives replicated data and has a fairly heavy read load.

Question: What should I be looking for to make this query more performant on the larger server where it runs slower? Ideally this query runs much faster than it does on the smaller server. It appears as we have scaled we have failed to correctly set the settings to take advantage of our hardware. There must be something I'm overlooking.

Edit: I put the non-obfuscated plans up. I have also tried bumping statistics up to 3000 from 1000 and it didn't help the plans. Same goes for changing the random_page_cost to match between the servers.

Laurenz Albe · Accepted Answer

The PostgreSQL configuration on the two machines is quite different, so it is not surprising that the query plans are different. Particularly random_page_cost has a big impact.

You definitely should benchmark your workload with different shared_buffers settings: yours is probably way too high (normally, 8GB is appropriate).

But I think that both your query plans are terrible, and your problem is in a different spot.

The optimizer misestimates the amount of rows returned from an index scan on showtimes.mappable_program, and this misestimate causes even worse misestimates and bad plan choices later on.

Try increasing the density of the statistics on the column:

ALTER TABLE showtimes.mappable_program ALTER mapping_scheme_id
   SET STATISTICS 1000;

Then ANALYZE the table.

If that doesn't do the trick, modify the query by replacing

WHERE COALESCE(mp2.ignore::integer, 0) = 0

with

WHERE mp2.ignore = '0' OR mp2.ignore IS NULL

That might help the optimizer estimate the condition better.

Postgres Differing Query Plans Prod/QA

Answers (1)

Related Questions