tolek
tolek

Reputation: 95

Postgresql Alter Table freezes DB (CPU is high-loaded by some Select)

PostgreSQL 9.6

  1. I do

    ALTER TABLE "Users"
       ADD COLUMN "someField" BOOLEAN NULL DEFAULT NULL;
    
  2. DB freezes for 10+ minutes.

  3. The table "Users" has 5,000 rows, but the database is very big (150+ Gb).

  4. Immediately, more than 20+ SELECT queries on Users table show up, (checked with):

    SELECT query FROM pg_stat_activity
    WHERE state = 'active' and query LIKE 'SELECT%'
    

    (before, there were no queries)

  5. These SELECT queries take all the CPU.

  6. I tried to restart the database and tried to do VACUUM ANALYZE on the "Users" table:

INFO:  vacuuming "public.Users"
INFO:  index "Users_pkey" now contains 5556 row versions in 86 pages
DETAIL:  0 index row versions were removed.
1 index pages have been deleted, 1 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.12 sec.
INFO:  index "users_subscription_canceled" now contains 5028 row versions in 508 pages
DETAIL:  0 index row versions were removed.
8 index pages have been deleted, 8 are currently reusable.
CPU 0.00s/0.00u sec elapsed 1.54 sec.
INFO:  index "users_shopper_id" now contains 5556 row versions in 205 pages
DETAIL:  0 index row versions were removed.
1 index pages have been deleted, 1 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.64 sec.
INFO:  index "users_referrer" now contains 5556 row versions in 586 pages
DETAIL:  0 index row versions were removed.
4 index pages have been deleted, 4 are currently reusable.
CPU 0.00s/0.00u sec elapsed 1.80 sec.
INFO:  index "users_referral_code" now contains 5556 row versions in 84 pages
DETAIL:  0 index row versions were removed.
2 index pages have been deleted, 2 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.27 sec.
INFO:  "Users": found 0 removable, 2137 nonremovable row versions in 803 out of 2780 pages
DETAIL:  445 dead row versions cannot be removed yet.
There were 25647 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 6.91 sec.
INFO:  "Users": stopping truncate due to conflicting lock request
INFO:  vacuuming "pg_toast.pg_toast_26460"
INFO:  index "pg_toast_26460_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_26460": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  analyzing "public.Users"
INFO:  "Users": scanned 2780 of 2780 pages, containing 5539 live rows and 459 dead rows; 5539 rows in sample, 5539 estimated total rows
VACUUM
  1. If I do DROP COLUMN or RENAME COLUMN, the same points 4 & 5 happen - and database freezes.

Questions:

  1. Is this some kind of error with the database? Adding Nullable field should be really fast.

  2. Any recommendations are really welcome, I am tired of Googling and debugging it :)

Upvotes: 5

Views: 1188

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246238

This is normal behavior; the problem is your workload.

You are right that ALTER TABLE ... ADD COLUMN is a very fast operation. That is not the problem. The problem is that such an ALTER TABLE needs a short ACCESS EXCLUSIVE lock on the table, since it modifies the table structure.

Such an ACCESS EXCLUSIVE lock is not compatible with the ACCESS SHARE lock that a SELECT statement puts on the table. That is the purpose: how should a SELECT statement behave if the table changes while it is running?

Now the problem is that either your queries take a long time, or somebody forgot to close a transaction that has a lock on the table.

You can check this with

SELECT pid, a.state, a.xact_start
FROM pg_locks AS l
   JOIN pg_stat_activity AS a USING (pid)
WHERE l.relation = 'Users'::regclass;

That will show all transactions that have a lock on the table and when they started.

Now your ALTER TABLE has to wait until all these transactions are done, and all the short SELECT statements that are issued later have to queue behind the ALTER TABLE.

As soon as the ALTER TABLE gets the lock it needs, it will be done very quickly and release the lock on the table. Now all the other statements that have been queuing will be turned loose at the same time and create high load on your machine.

The solution consists of two parts:

  1. Fix the application so that it closes transactions right away.

  2. Reduce max_connections as much as possible by using a connection pool. Then the number of statements that can be blocked is limited, and there is less danger of overloading the machine.

Upvotes: 5

Related Questions