Why is the postgreSQL waiting while executing vacuum full table? 4T table data

Question

I have a bloated table, its name is "role_info". There are about 20K insert operations and a lot of update operations per day, there are no delete operations. The table is about 4063GB now. We have migrated the table to another database using dump, and the new table is about 62GB, so the table on the old database is bloated very seriously.

PostgreSQL version: 9.5.4

The table schema is below:

CREATE TABLE "role_info" (
  "roleId" bigint NOT NULL,
  "playerId" bigint NOT NULL,
  "serverId" int NOT NULL,
  "status" int NOT NULL,
  "baseData" bytea NOT NULL,
  "detailData" bytea NOT NULL,
  PRIMARY KEY ("roleId")
);
CREATE INDEX "idx_role_info_serverId_playerId_roleId" ON "role_info" ("serverId", "playerId", "roleId");

The average size of field 'detailData' is about 13KB each line.

There are some SQL execution results below:

1)

SELECT 
    relname AS name,
    pg_stat_get_live_tuples(c.oid) AS lives,
    pg_stat_get_dead_tuples(c.oid) AS deads
FROM pg_class c
ORDER BY deads DESC;

Execution Result:

2)

SELECT *, 
       Pg_size_pretty(total_bytes) AS total, 
       Pg_size_pretty(index_bytes) AS INDEX, 
       Pg_size_pretty(toast_bytes) AS toast, 
       Pg_size_pretty(table_bytes) AS TABLE 
FROM   (SELECT *, 
               total_bytes - index_bytes - Coalesce(toast_bytes, 0) AS 
               table_bytes 
        FROM   (SELECT c.oid, 
                       nspname                               AS table_schema, 
                       relname                               AS TABLE_NAME, 
                       c.reltuples                           AS row_estimate, 
                       Pg_total_relation_size(c.oid)         AS total_bytes, 
                       Pg_indexes_size(c.oid)                AS index_bytes, 
                       Pg_total_relation_size(reltoastrelid) AS toast_bytes 
                FROM   pg_class c 
                       LEFT JOIN pg_namespace n 
                              ON n.oid = c.relnamespace 
                WHERE  relkind = 'r') a 
        WHERE  table_schema = 'public' 
        ORDER  BY total_bytes DESC) a;

Execution Result:

3)

I have tried to vacuum full the table "role_info", but it seemed blocked by some other process, and didn't execute at all.

select * from pg_stat_activity where query like '%VACUUM%' and query not like '%pg_stat_activity%';

Execution Result:

select * from pg_locks;

Execution Result:

There are parameters of vacuum:

I have two questions:

How to deal with table bloating? autovacuum seems not working.
Why did the vacuum full blocked?

jjanes · Accepted Answer

With your autovacuum settings, it will sleep for 20ms once for every 10 pages (200 cost_limit / 20 cost_dirty) it dirties. Even more because there will also be cost_hit and cost_miss as well. At that rate is would take over 12 days to autovacuum a 4063GB table which is mostly in need of dirtying pages. That is just the throttling time, not counting the actual work-time, nor the repeated scanning of the indexes. So it the actual run time could be months. The chances of autovacuum getting to run to completion in one sitting without being interrupted by something could be pretty low. Does your database get restarted often? Do you build and drop indexes on this table a lot, or add and drop partitions, or run ALTER TABLE?

Note that in v12, the default setting of autovacuum_vacuum_cost_delay was lowered by a factor of 10. This is not just because of some change to the code in v12, it was because we realized the default setting was just not sensible on modern hardware. So it would probably make sense to backport this change your existing database, if not go even further. Before 12, you can't lower to less than 1 ms, but you could lower it to 1 ms and also either increase autovacuum_vacuum_cost_delay or lower vacuum_cost_page_* setting.

Now this analysis is based on the table already being extremely bloated. Why didn't autovacuum prevent it from getting this bloated in the first place, back when the table was small enough to be autovacuumed in a reasonable time? That is hard to say. We really have no evidence as to what happened back then. Maybe your settings were even more throttled than they are now (although unlikely as it looks like you just accepted the defaults), maybe it was constantly interrupted by something. What is the "autovacuum_count" from pg_stat_all_tables for the table and its toast table?

Why did the vacuum full blocked?

Because that is how it works, as documented. That is why it is important to avoid getting into this situation in the first place. VACUUM FULL needs to swap around filenodes at the end, and needs an AccessExclusive lock to do that. It could take a weaker lock at first and then try to upgrade to AccessExclusive later, but lock upgrades have a strong deadlock risk, so it takes the strongest lock it needs up front.

You need a maintenance window where no one else is using the table. If you think you are already in such window, then you should look at the query text for the process doing the blocking. Because the lock already held is ShareUpdateExclusive, the thing holding it is not a normal query/DML, but some kind of DDL or maintenance operation.

If you can't take a maintenance window now, then you can at least do a manual VACUUM without the FULL. This takes a much weaker lock. It probably won't shrink the table dramatically, but should at least free up space for internal reuse so it stops getting even bigger while you figure out when you can schedule a maintenance window or what your other next steps are.

Why is the postgreSQL waiting while executing vacuum full table? 4T table data

Answers (1)

Related Questions