Reputation: 716
Currently I have a postgres database with around 100 tables in the public schema. The performance of one of the large tables (around 4 million) was quite slow when upserting data to it (insert with conflict). When I created a like for like copy of the table (same table schema) in a different schema within the same database the upsert query performance significantly increased (6 minutes to under 3).
To create the tables (key is a basic hash):
CREATE TABLE test_table(
id SERIAL,
...,
key CHARACTER VARYING,
CONSTRAINT primary_key_constraint PRIMARY KEY (id),
CONSTRAINT key_uniqueness_constraint UNIQUE (key)
);
To upload the data to a dummy table using the same schema as test_table for upsert:
COPY dummy_table(
...,
key
)
FROM '<path-to-csv>' DELIMITER ',' CSV HEADER;
To upsert data:
INSERT INTO test_table(
...,
key
)
SELECT
...,
key
FROM
dummy_table
ON CONFLICT(key)
DO UPDATE SET
... = "some-change"
In order to actually test an upsert with conflict I insert half the data from dummy_table into test_table, followed by the rest once the query has finished.
VACUUM FULL
and VACUUM ANALYZE
was run on the public schema with 100 tables before the testsEXPLAIN ANALYSE
gives two identical query plans, except one runs much slowerNOTES: - all tables in the database, bar the test ones in the new schema, are within the public schema - [UPDATE] I realise now that the existing table was heavily upserted into before the tests were carried out. The data in both tables was still exactly the same.
Insert on test_table (cost=0.00..189096.00 rows=4000000 width=221) (actual time=1228323.809..1228323.809 rows=0 loops=1)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: key_uniqueness_constraint
Tuples Inserted: 0
Conflicting Tuples: 4000000
Buffers: shared hit=69229375 read=3940270 dirtied=3756690 written=3290015
-> Seq Scan on all_data (cost=0.00..189096.00 rows=4000000 width=221) (actual time=76.951..15397.431 rows=4000000 loops=1)
Buffers: shared hit=4000008 read=129100 dirtied=19 written=3
Planning Time: 157.181 ms
Execution Time: 1228335.801 ms
Insert on test_table (cost=0.00..189033.00 rows=4000000 width=221) (actual time=260305.406..260305.406 rows=0 loops=1)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: key_uniqueness_constraint
Tuples Inserted: 0
Conflicting Tuples: 4000000
Buffers: shared hit=72766787 read=560868 dirtied=431829 written=265784
-> Seq Scan on all_data (cost=0.00..189033.00 rows=4000000 width=221) (actual time=44.326..15560.603 rows=4000000 loops=1)
Buffers: shared hit=4000008 read=129036 dirtied=8
Planning Time: 88.202 ms
Execution Time: 260312.309 ms
Table upsert performance was bad within the public schema, shifted the table to a brand new schema within the same database and the performance significantly increased.
Upvotes: 1
Views: 663
Reputation: 246788
If changing the fillfactor
helped, which is what I read out of your comment, then it must be PostgreSQL “write amplification” problem:
every update writes a new version of the row
consequently, every index on the table must get a new entry pointing to the new row as well, which causes more writes and worse performance than in other databases
The remedy is to
lower the fillfactor and
make sure that no modified column is indexed
Then PostgreSQL can use HOT (“heap only tuple”) updates, which “redirect” from the original item pointer to the new row version and so avoid modifying the index at all.
Upvotes: 2