Reputation: 1085
I have a database with a single table. This table will need to be updated every few weeks. We need to ingest third-party data into it and it will contain 100-120 million rows. So the flow is basically:
What's the best way of detecting and performing updates? Some options are:
What do you suggest is the best option, or if there's a different option out there?
Upvotes: 6
Views: 21488
Reputation: 1796
Postgres has a helpful guide for improving performance of bulk loads. From your description, you need to perform a bulk INSERT
in addition to a bulk UPDATE
and DELETE
. Below is a roughly step by step guide for making this efficient:
ALTER SYSTEM SET max_wal_size = <size>;
You can additionally disable WAL entirely.
ALTER SYSTEM SET wal_level = 'minimal';
ALTER SYSTEM SET archive_mode = 'off';
ALTER SYSTEM SET max_wal_senders = 0;
Note that these changes will require a database restart to take effect.
You want all work to be done in a single transaction in case anything goes wrong. Running COPY in parallel across multiple connections does not usually increase performance as disk is usually the limiting factor.
SET LOCAL maintenance_work_mem = <size>
...
You may need to set other configuration parameters if you are doing any additional special processing of the data inside Postgres (work_mem
is usually most important there especially if using Postgis extension.) See this guide for the most important configuration variables for performance.
CREATE
a TEMPORARY
table with no constraints.CREATE TEMPORARY TABLE changes(
id bigint,
data text,
) ON COMMIT DROP; --ensures this table will be dropped at end of transaction
changes
using COPY FROM
Use the COPY FROM
Command to bulk insert the raw data into the temporary table.
COPY changes(id,data) FROM ..
DROP
Relations That Can Slow ProcessingOn the target
table, DROP
all foreign key constraints, indexes and triggers (where possible). Don't drop your PRIMARY KEY, as you'll want that for the INSERT
.
target
TableAdd a column to target
table to determine if row was present in changes table:
ALTER TABLE target ADD COLUMN seen boolean;
changes
table into the target
table:UPSERTs are performed by adding an ON CONFLICT
clause to a standard INSERT
statement. This prevents the need from performing two separate operations.
INSERT INTO target(id,data,seen)
SELECT
id,
data,
true
FROM
changes
ON CONFLICT (id) DO UPDATE SET data = EXCLUDED.data, seen = true;
DELETE
Rows Not In changes
TableDELETE FROM target WHERE not seen is true;
DROP
Tracking Column and Temporary changes
TableDROP TABLE changes;
ALTER TABLE target DROP COLUMN seen;
Add back all constraints, triggers and indexes that were dropped to improve bulk upsert performance.
The bulk upsert/delete is complete and the following commands should be performed outside of a transaction.
VACUUM ANALYZE
on the target
Table.This will allow the query planner to make appropriate inferences about the table and reclaim space taken up by dead tuples.
SET maintenance_work_mem = <size>
VACUUM ANALYZE target;
SET maintenance_work_mem = <original size>
ALTER SYSTEM SET max_wal_size = <size>;
...
You may need to restart your database again for these settings to take effect.
Upvotes: 14