Papa_Helix
Papa_Helix

Reputation: 716

Different performance of query run against two tables with the same data

BACKGROUND

Currently I have a postgres database with around 100 tables in the public schema. The performance of one of the large tables (around 4 million) was quite slow when upserting data to it (insert with conflict). When I created a like for like copy of the table (same table schema) in a different schema within the same database the upsert query performance significantly increased (6 minutes to under 3).

CODE USED

To create the tables (key is a basic hash):

CREATE TABLE test_table(
    id SERIAL,
    ...,
    key CHARACTER VARYING,
    CONSTRAINT primary_key_constraint PRIMARY KEY (id),
    CONSTRAINT key_uniqueness_constraint UNIQUE (key)
);

To upload the data to a dummy table using the same schema as test_table for upsert:

COPY dummy_table(
    ...,
    key
)
FROM '<path-to-csv>' DELIMITER ',' CSV HEADER;

To upsert data:

INSERT INTO test_table(
    ...,
    key
)
SELECT
    ...,
    key
FROM
    dummy_table
ON CONFLICT(key)
DO UPDATE SET
    ... = "some-change"

In order to actually test an upsert with conflict I insert half the data from dummy_table into test_table, followed by the rest once the query has finished.

CONSIDERATIONS

NOTES: - all tables in the database, bar the test ones in the new schema, are within the public schema - [UPDATE] I realise now that the existing table was heavily upserted into before the tests were carried out. The data in both tables was still exactly the same.

EXPLAIN (ANALYZE, BUFFERS) OUTPUT

Insert on test_table  (cost=0.00..189096.00 rows=4000000 width=221) (actual time=1228323.809..1228323.809 rows=0 loops=1)
  Conflict Resolution: UPDATE
  Conflict Arbiter Indexes: key_uniqueness_constraint
  Tuples Inserted: 0
  Conflicting Tuples: 4000000
  Buffers: shared hit=69229375 read=3940270 dirtied=3756690 written=3290015
  ->  Seq Scan on all_data  (cost=0.00..189096.00 rows=4000000 width=221) (actual time=76.951..15397.431 rows=4000000 loops=1)
        Buffers: shared hit=4000008 read=129100 dirtied=19 written=3
Planning Time: 157.181 ms
Execution Time: 1228335.801 ms
Insert on test_table  (cost=0.00..189033.00 rows=4000000 width=221) (actual time=260305.406..260305.406 rows=0 loops=1)
  Conflict Resolution: UPDATE
  Conflict Arbiter Indexes: key_uniqueness_constraint
  Tuples Inserted: 0
  Conflicting Tuples: 4000000
  Buffers: shared hit=72766787 read=560868 dirtied=431829 written=265784
  ->  Seq Scan on all_data  (cost=0.00..189033.00 rows=4000000 width=221) (actual time=44.326..15560.603 rows=4000000 loops=1)
        Buffers: shared hit=4000008 read=129036 dirtied=8
Planning Time: 88.202 ms
Execution Time: 260312.309 ms

TL;DNR

Table upsert performance was bad within the public schema, shifted the table to a brand new schema within the same database and the performance significantly increased.

Upvotes: 1

Views: 663

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246788

If changing the fillfactor helped, which is what I read out of your comment, then it must be PostgreSQL “write amplification” problem:

  • every update writes a new version of the row

  • consequently, every index on the table must get a new entry pointing to the new row as well, which causes more writes and worse performance than in other databases

The remedy is to

  • lower the fillfactor and

  • make sure that no modified column is indexed

Then PostgreSQL can use HOT (“heap only tuple”) updates, which “redirect” from the original item pointer to the new row version and so avoid modifying the index at all.

Upvotes: 2

Related Questions