Reputation: 27517
I am running Postgres 11.2 on Heroku in production and Postgres 11.2 locally.
I have one specific query that's timing out in production.
This is an example of the query:
SELECT u0.id, u0.app, u0.email, u0.gender, u0.fcm_token,
u0.flagged_user_ids, u0.flags, u0.gems_balance, u0.hearts_balance,
u0.hearts_received, u0.last_hearts_refreshed, u0.looking, u0.matches,
u0.meta_tags, u0.name, u0.online, u0.profile_image, u0.purchase_count,
u0.recents_history, u0.inserted_at, u0.updated_at
FROM users AS u0 WHERE ((u0.email != '[email protected]') AND (u0.looking = TRUE))
AND (NOT (u0.email IN ('[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]')))
AND (NOT ('apple_review' = ANY(u0.meta_tags)))
AND (NOT ('[email protected]' = ANY(u0.matches)))
AND (NOT (584065 = ANY(u0.flagged_user_ids)))
ORDER BY u0.updated_at LIMIT 1;
When I download the production DB locally, pg_dump
, go into the SQL console, and run the query locally it's super fast.
Production:
Time: 46307.569 ms (00:46.308)
46 seconds!
Local:
Time: 176.106 ms
<1 second!
I have verified that my local copy of that table is identical to the one in production.
Production:
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
"users_app_index" btree (app)
"users_email_index" btree (email)
"users_matches_index" btree (matches)
Referenced by:
TABLE "chat_users" CONSTRAINT "chat_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id)
TABLE "messages" CONSTRAINT "messages_from_id_fkey" FOREIGN KEY (from_id) REFERENCES users(id)
Local:
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
"users_app_index" btree (app)
"users_email_index" btree (email)
"users_matches_index" btree (matches)
Referenced by:
TABLE "chat_users" CONSTRAINT "chat_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id)
TABLE "messages" CONSTRAINT "messages_from_id_fkey" FOREIGN KEY (from_id) REFERENCES users(id)
I don't understand why the one in production is taking so much longer. Is there a way to debug this?
UPDATE:
As requested, here are the execution plans of the two:
Production:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=443802.88..443802.88 rows=1 width=834) (actual time=40956.176..40956.176 rows=0 loops=1)
Buffers: shared hit=5074 read=431010 dirtied=20
I/O Timings: read=39940.500
-> Sort (cost=443802.88..443802.88 rows=1 width=834) (actual time=40956.174..40956.175 rows=0 loops=1)
Sort Key: updated_at
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=5074 read=431010 dirtied=20
I/O Timings: read=39940.500
-> Seq Scan on users u0 (cost=0.00..443802.88 rows=1 width=834) (actual time=40956.166..40956.166 rows=0 loops=1)
Filter: (looking AND ((email)::text <> '[email protected]'::text) AND ((email)::text <> ALL ('{[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]}'::text[])) AND ('apple_review'::text <> ALL ((meta_tags)::text[])) AND ('[email protected]'::text <> ALL ((matches)::text[])) AND (584065 <> ALL (flagged_user_ids)))
Rows Removed by Filter: 583206
Buffers: shared hit=5074 read=431010 dirtied=20
I/O Timings: read=39940.500
Planning Time: 0.259 ms
Execution Time: 40956.221 ms
(15 rows)
Local:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=78520.41..78520.41 rows=1 width=827) (actual time=164.434..164.434 rows=0 loops=1)
Buffers: shared hit=11616 read=51044
-> Sort (cost=78520.41..78520.41 rows=1 width=827) (actual time=164.433..164.433 rows=0 loops=1)
Sort Key: updated_at
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=11616 read=51044
-> Gather (cost=1000.00..78520.40 rows=1 width=827) (actual time=164.414..167.228 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=11616 read=51044
-> Parallel Seq Scan on users u0 (cost=0.00..77520.30 rows=1 width=827) (actual time=159.986..159.986 rows=0 loops=3)
Filter: (looking AND ((email)::text <> '[email protected]'::text) AND ((email)::text <> ALL ('{[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]}'::text[])) AND ('apple_review'::text <> ALL ((meta_tags)::text[])) AND ('[email protected]'::text <> ALL ((matches)::text[])) AND (584065 <> ALL (flagged_user_ids)))
Rows Removed by Filter: 194158
Buffers: shared hit=11616 read=51044
Planning Time: 0.376 ms
Execution Time: 167.300 ms
(16 rows)
ext <> ALL ((matches)::text[])) AND (584065 <> ALL (flagged_user_ids))) Rows Removed by Filter: 194158 Planning Time: 0.389 ms Execution Time: 167.745 ms (12 rows)
UPDATE 2
Production:
SELECT reltuples, relpages
[more] - > FROM pg_class
[more] - > WHERE relname = 'users';
reltuples | relpages
-----------+----------
582557 | 436084
(1 row)
Time: 27.967 ms
Local:
SELECT reltuples, relpages
[more] - > FROM pg_class
[more] - > WHERE relname = 'users';
reltuples | relpages
-----------+----------
582281 | 62660
(1 row)
Time: 1.816 ms
UPDATE 3:
After looking at the execution plan and snooping around, I think it might be because of the Parallel Seq Scan
. However, I tried SET max_parallel_workers_per_gather TO 4;
and it still seems to be very slow even with the Parallel Seq Scan
. It looks like I/O is taking a long time:
Production:
Limit (cost=439013.82..439013.82 rows=1 width=834) (actual time=36739.071..36739.071 rows=0 loops=1)
Buffers: shared hit=5176 read=430908
I/O Timings: read=182157.859
-> Sort (cost=439013.82..439013.82 rows=1 width=834) (actual time=36739.069..36739.069 rows=0 loops=1)
Sort Key: updated_at
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=5176 read=430908
I/O Timings: read=182157.859
-> Gather (cost=1000.00..439013.82 rows=1 width=834) (actual time=36739.061..36744.559 rows=0 loops=1)
Workers Planned: 4
Workers Launched: 4
Buffers: shared hit=5176 read=430908
I/O Timings: read=182157.859
-> Parallel Seq Scan on users u0 (cost=0.00..438013.72 rows=1 width=834) (actual time=36727.506..36727.506 rows=0 loops=5)
Filter: (looking AND ((email)::text <> '[email protected]'::text) AND ((email)::text <> ALL ('{[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]}'::text[])) AND ('apple_review'::text <> ALL ((meta_tags)::text[])) AND ('[email protected]'::text <> ALL ((matches)::text[])) AND (584065 <> ALL (flagged_user_ids)))
Rows Removed by Filter: 116645
Buffers: shared hit=5176 read=430908
I/O Timings: read=182157.859
Planning Time: 0.265 ms
Execution Time: 36744.622 ms
(20 rows)
Upvotes: 2
Views: 545
Reputation: 246758
Some observations (not your major problem):
The slow server has 2.5 times as many rows in the table, but that does not account for a slowdown by a factor of 250.
On the slow machine, no parallel workers were planned, even though the table is bigger. Could it be that you are running a PostgreSQL version older than 9.6 on the slow server? If not, maybe you disabled parallelism (max_parallel_workers_per_gather
or max_parallel_workers
are set to 0, or some other relevant configuration parameter has been changed). But that also cannot account for a slowdown by a factor of 250.
Your real problem is the following:
EXPLAIN (ANALYZE, BUFFERS)
shows that almost all blocks are read, and since you have track_io_timing
enabled, we know that 40 of the 41 seconds the query takes is spent doing I/O.This is exacerbated by the following two points:
The table on the slow system is quite bloated (6/7 of the table seem to be empty space, since there are about as many tuples in a table that is seven times as big). So the query has to read seven times as many blocks as necessary.
You might consider running VACUUM (FULL) users
next time you can afford some down time and reduce the setting of autovacuum_vacuum_cost_delay
.
Perhaps a slow I/O subsystem on the slow server contributes to the problem.
Your WHERE
conditions are all written in a way that cannot make use of indexes. This is really a query that requires a sequential scan.
You can throw memory at the problem (at least 4GB) and use pg_prewarm
to load the table into cache. Or you can get really fast SSD storage. I don't think there is much more you can do.
Upvotes: 2