How to archive a giant postgres table?

Question

We have a postgres table called history which is almost 900GB and continpusly increasing 10GB per day. This is table is being accessed by a microservice (carts). We have postgres replication setup (one Master and one Slave). 2 Instances of the microservice is running in production, where 1 instance uses the master postgres connection to write and read data for some endpoints and 1 another instance is using postgres slave connection to just the read the data alone.

Table definition:

id - uuid
data - jsonb column
internal - jsonb column
context - jsonb column
created_date - date
modified_date - date

Now in the above table data and internal column is loaded with big json for every row. We have came to a conclusion on null'ing the data and internal column will reduce the total space consumption by this table.

Question:

How to archive this giant table? (meaning only cleaning up data and internal column alone).
How to achieve this without zero down time / performance degradation?

Approaches tested as of now.

Using pg_repack (this is best idea so far, but the issue here is once pg_repack is done then entire new table needs to get synced with slave instance which is causing WAL overhead).
Just nullify the data and internal column alone - Problem with this approach is just increases the table size due to postgres follows MVCC pattern.
Using Temp table and clone the data
- Create a UNLOGGED table - historyv2
- Copy the data from the original table to the historyv2 table without data and internal
- then switch the table to LOGGED. ( I guess this will also cause the WAL overhead)
- then rename the tables.

Can you guys give me some pointers on how to achieve this?

Postgres Version: 9.5

xzilla · Accepted Answer

I always feel that questions like these conflate a few different ideas, which makes them seem more complicated than they should be. What does minimal performance impact mean? Generating lots of WAL may increase file i/o and cpu and network usage, but in most cases does not affect the system enough to have a client-facing impact. If no downtime is the most important thing, you should focus on optimizing for that, and not worry about what the process is to get there (within reason).

That said, if I woke up in your shoes, I would work first set up partitions for data going forward using table-inheritance, so that the data could be more easily segmented and worked on in the future. (This isn't entirely necessary, but probably makes your life easier in the future). After that, I would write a script to slowly go through the old data and creating new partitions with the "nulled out" data, interleaving partition creation, deletion of data, and vacuums against the main table. Once that is automated, you can let it churn slowly or during off-hours until it is done. You might need to do a final repack or vacuum full on the parent once all the data is moved, but it's probably ok even without it. Again, this isn't the simplest idea, probably not the fastest way to do it (if you could have downtime), but in the end, you'll have the schema you want without causing any service disruptions.

How to archive a giant postgres table?

Answers (1)

Related Questions