bigpotato
bigpotato

Reputation: 27517

Postgres DB: SQL statement in production is taking much longer than locally

I am running Postgres 11.2 on Heroku in production and Postgres 11.2 locally.

I have one specific query that's timing out in production.

This is an example of the query:

SELECT u0.id, u0.app, u0.email, u0.gender, u0.fcm_token,
u0.flagged_user_ids, u0.flags, u0.gems_balance, u0.hearts_balance,
u0.hearts_received, u0.last_hearts_refreshed, u0.looking, u0.matches,
u0.meta_tags, u0.name, u0.online, u0.profile_image, u0.purchase_count,
u0.recents_history, u0.inserted_at, u0.updated_at
FROM users AS u0 WHERE ((u0.email != '[email protected]') AND (u0.looking = TRUE))
AND (NOT (u0.email IN ('[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]')))
AND (NOT ('apple_review' = ANY(u0.meta_tags)))
AND (NOT ('[email protected]' = ANY(u0.matches)))
AND (NOT (584065 = ANY(u0.flagged_user_ids)))
ORDER BY u0.updated_at LIMIT 1;

When I download the production DB locally, pg_dump, go into the SQL console, and run the query locally it's super fast.

Production: Time: 46307.569 ms (00:46.308) 46 seconds!

Local: Time: 176.106 ms <1 second!

I have verified that my local copy of that table is identical to the one in production.

Production:

Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_app_index" btree (app)
    "users_email_index" btree (email)
    "users_matches_index" btree (matches)
Referenced by:
    TABLE "chat_users" CONSTRAINT "chat_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id)
    TABLE "messages" CONSTRAINT "messages_from_id_fkey" FOREIGN KEY (from_id) REFERENCES users(id)

Local:

Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_app_index" btree (app)
    "users_email_index" btree (email)
    "users_matches_index" btree (matches)
Referenced by:
    TABLE "chat_users" CONSTRAINT "chat_users_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id)
    TABLE "messages" CONSTRAINT "messages_from_id_fkey" FOREIGN KEY (from_id) REFERENCES users(id)

I don't understand why the one in production is taking so much longer. Is there a way to debug this?

UPDATE:

As requested, here are the execution plans of the two:

Production:

QUERY PLAN                                                                                                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=443802.88..443802.88 rows=1 width=834) (actual time=40956.176..40956.176 rows=0 loops=1)
   Buffers: shared hit=5074 read=431010 dirtied=20
   I/O Timings: read=39940.500
   ->  Sort  (cost=443802.88..443802.88 rows=1 width=834) (actual time=40956.174..40956.175 rows=0 loops=1)
         Sort Key: updated_at
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=5074 read=431010 dirtied=20
         I/O Timings: read=39940.500
         ->  Seq Scan on users u0  (cost=0.00..443802.88 rows=1 width=834) (actual time=40956.166..40956.166 rows=0 loops=1)
               Filter: (looking AND ((email)::text <> '[email protected]'::text) AND ((email)::text <> ALL ('{[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]}'::text[])) AND ('apple_review'::text <> ALL ((meta_tags)::text[])) AND ('[email protected]'::text <> ALL ((matches)::text[])) AND (584065 <> ALL (flagged_user_ids)))
               Rows Removed by Filter: 583206
               Buffers: shared hit=5074 read=431010 dirtied=20
               I/O Timings: read=39940.500
 Planning Time: 0.259 ms
 Execution Time: 40956.221 ms
(15 rows)

Local:

QUERY PLAN                                                                                                                                                                                   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=78520.41..78520.41 rows=1 width=827) (actual time=164.434..164.434 rows=0 loops=1)
   Buffers: shared hit=11616 read=51044
   ->  Sort  (cost=78520.41..78520.41 rows=1 width=827) (actual time=164.433..164.433 rows=0 loops=1)
         Sort Key: updated_at
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=11616 read=51044
         ->  Gather  (cost=1000.00..78520.40 rows=1 width=827) (actual time=164.414..167.228 rows=0 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               Buffers: shared hit=11616 read=51044
               ->  Parallel Seq Scan on users u0  (cost=0.00..77520.30 rows=1 width=827) (actual time=159.986..159.986 rows=0 loops=3)
                     Filter: (looking AND ((email)::text <> '[email protected]'::text) AND ((email)::text <> ALL ('{[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]}'::text[])) AND ('apple_review'::text <> ALL ((meta_tags)::text[])) AND ('[email protected]'::text <> ALL ((matches)::text[])) AND (584065 <> ALL (flagged_user_ids)))
                     Rows Removed by Filter: 194158
                     Buffers: shared hit=11616 read=51044
 Planning Time: 0.376 ms
 Execution Time: 167.300 ms
(16 rows)

ext <> ALL ((matches)::text[])) AND (584065 <> ALL (flagged_user_ids))) Rows Removed by Filter: 194158 Planning Time: 0.389 ms Execution Time: 167.745 ms (12 rows)

UPDATE 2

Production:

 SELECT reltuples, relpages
[more] - > FROM pg_class
[more] - > WHERE relname = 'users';
 reltuples | relpages
-----------+----------
    582557 |   436084
(1 row)

Time: 27.967 ms

Local:

 SELECT reltuples, relpages
[more] - > FROM pg_class
[more] - > WHERE relname = 'users';
 reltuples | relpages
-----------+----------
    582281 |    62660
(1 row)

Time: 1.816 ms

UPDATE 3:

After looking at the execution plan and snooping around, I think it might be because of the Parallel Seq Scan. However, I tried SET max_parallel_workers_per_gather TO 4; and it still seems to be very slow even with the Parallel Seq Scan. It looks like I/O is taking a long time:

Production:

Limit  (cost=439013.82..439013.82 rows=1 width=834) (actual time=36739.071..36739.071 rows=0 loops=1)
   Buffers: shared hit=5176 read=430908
   I/O Timings: read=182157.859
   ->  Sort  (cost=439013.82..439013.82 rows=1 width=834) (actual time=36739.069..36739.069 rows=0 loops=1)
         Sort Key: updated_at
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=5176 read=430908
         I/O Timings: read=182157.859
         ->  Gather  (cost=1000.00..439013.82 rows=1 width=834) (actual time=36739.061..36744.559 rows=0 loops=1)
               Workers Planned: 4
               Workers Launched: 4
               Buffers: shared hit=5176 read=430908
               I/O Timings: read=182157.859
               ->  Parallel Seq Scan on users u0  (cost=0.00..438013.72 rows=1 width=834) (actual time=36727.506..36727.506 rows=0 loops=5)
                     Filter: (looking AND ((email)::text <> '[email protected]'::text) AND ((email)::text <> ALL ('{[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]}'::text[])) AND ('apple_review'::text <> ALL ((meta_tags)::text[])) AND ('[email protected]'::text <> ALL ((matches)::text[])) AND (584065 <> ALL (flagged_user_ids)))
                     Rows Removed by Filter: 116645
                     Buffers: shared hit=5176 read=430908
                     I/O Timings: read=182157.859
 Planning Time: 0.265 ms
 Execution Time: 36744.622 ms
(20 rows)

Upvotes: 2

Views: 545

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246758

Some observations (not your major problem):

  • The slow server has 2.5 times as many rows in the table, but that does not account for a slowdown by a factor of 250.

  • On the slow machine, no parallel workers were planned, even though the table is bigger. Could it be that you are running a PostgreSQL version older than 9.6 on the slow server? If not, maybe you disabled parallelism (max_parallel_workers_per_gather or max_parallel_workers are set to 0, or some other relevant configuration parameter has been changed). But that also cannot account for a slowdown by a factor of 250.

Your real problem is the following:

  • The sequential scan has to read most of the blocks from secondary storage on the slow machine. EXPLAIN (ANALYZE, BUFFERS) shows that almost all blocks are read, and since you have track_io_timing enabled, we know that 40 of the 41 seconds the query takes is spent doing I/O.

This is exacerbated by the following two points:

  • The table on the slow system is quite bloated (6/7 of the table seem to be empty space, since there are about as many tuples in a table that is seven times as big). So the query has to read seven times as many blocks as necessary.

    You might consider running VACUUM (FULL) users next time you can afford some down time and reduce the setting of autovacuum_vacuum_cost_delay.

  • Perhaps a slow I/O subsystem on the slow server contributes to the problem.

Your WHERE conditions are all written in a way that cannot make use of indexes. This is really a query that requires a sequential scan.

You can throw memory at the problem (at least 4GB) and use pg_prewarm to load the table into cache. Or you can get really fast SSD storage. I don't think there is much more you can do.

Upvotes: 2

Related Questions