rap-2-h
rap-2-h

Reputation: 31998

No errors, infinite loading: where should I search?

Here is the context, it just happened today:

Then, everything worked fine just after running sudo service postgresql restart.

Where should I search in such a case? Is it a "common" problem? With no logs, I don't know what to do. How can I be relatively "sure" it will not randomly happen anymore?

Side note: Here is the detail of the postgreSQL log after I restarted:

2017-03-16 12:20:52 CET [857-2] LOG:  received fast shutdown request
2017-03-16 12:20:52 CET [857-3] LOG:  aborting any active transactions
2017-03-16 12:20:52 CET [6829-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6829-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6644-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6644-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6649-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6649-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6580-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6580-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6768-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6768-2] XXXXX@YYYY STATEMENT:  select * from "users" where "email" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6253-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command
2017-03-16 12:20:52 CET [6253-2] XXXXX@YYYY STATEMENT:  alter table "users" add column "email_confirmed" boolean not null default '1'
2017-03-16 12:20:52 CET [8465-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command
2017-03-16 12:20:52 CET [913-2] LOG:  autovacuum launcher shutting down
2017-03-16 12:20:52 CET [6586-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6586-2] XXXXX@YYYY STATEMENT:  select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [31969-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command
2017-03-16 12:20:52 CET [6300-1] XXXXX@YYYY FATAL:  terminating connection due to administrator command
2017-03-16 12:20:52 CET [6974-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [906-1] LOG:  shutting down
2017-03-16 12:20:52 CET [6980-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6978-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6975-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6977-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6976-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6983-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6979-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6982-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6981-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6985-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [6984-1] XXXXX@YYYY FATAL:  the database system is shutting down
2017-03-16 12:20:52 CET [906-2] LOG:  database system is shut down
2017-03-16 12:20:53 CET [7002-1] LOG:  database system was shut down at 2017-03-16 12:20:52 CET 

Upvotes: 0

Views: 887

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246848

This is the statement that caused your problem:

alter table "users" add column "email_confirmed" boolean not null default '1'

Such an ALTER TABLE requires an ACCESS EXCLUSIVE lock on the table, which conflicts even with read access.

Now the table users seems to be a busy table, which normally is no problem (unless two transactions try to write the same row) since the table locks they require don't conflict with each other.

Now along comes ALTER TABLE and has to wait until all earlier transactions on the table (which hold at least an ACCESS SHARE lock on the table) are done. All queries to the table that start after the ALTER TABLE have to queue behind it, because the lock they need conflicts with the lock that the ALTER TABLE needs.

Once the ALTER TABLE gets its ACCESS EXCLUSIVE lock it should be done rather quickly, and things go back to normal.

Now there is one riddle left: all the other interrupted queries in your log look like they are really short queries, so it is not obvious why you would have to wait that long – they should be done in a couple of microseconds.

To explain that, remember that a lock is not released when the query is finished, but is held until the end of the transaction. So if you run a short query in a transaction and then leave the transaction open, this idle database session will also block the ALTER TABLE. You might not notice under normal circumstances, particularly if all transactions are read-only, but in this case it will prove detrimental.

As an aside, there are other negative side effects of transactions that are kept open, notably that VACUUM cannot do its work, and your tables and indexes can get bloated.

You can check if your application has that problem by checking state from the pg_stat_activity system view. If you can regularly see sessions that are idle in transaction and stay that way for a while, you are looking at the root cause of your problem. This should be fixed!

Instead of restarting the server, you could also have interrupted the ALTER TABLE statement.

If you ever run into such a problem again, query the pg_locks system view. You will see that the session running the ALTER TABLE has a lock that has granted = FALSE, because it is waiting for a table lock. You can also see which other sessions are holding a lock on the table.

Upvotes: 3

Related Questions