How do I increase the speed of a bulk UPSERT in postgreSQL?

Question

I am trying to load many millions of data records, from multiple distinct sources, to a postgresql table with the following design:

CREATE TABLE public.variant_fact (
    variant_id bigint NOT NULL,
    ref_allele text NOT NULL,
    allele text NOT NULL,
    variant_name text NOT NULL,
    start bigint,
    stop bigint,
    variant_attributes jsonb
);

ALTER TABLE public.variant_fact
    ADD CONSTRAINT variant_fact_unique UNIQUE (variant_name, start, stop, allele, ref_allele)
    INCLUDE (ref_allele, allele, variant_name, start, stop);

Where "start" and "stop" are foreign keys and "variant_id" is an auto-incrementing primary key. I am running into issues with the loading speed because in order to perform the UPSERT, I need to check the table to see whether an element exists for each element I upload. I am performing the operation in python using psycopg2 using the execute_values method.

insert_query = """
        INSERT INTO variant_fact AS v (variant_id, ref_allele, allele, variant_name, start, stop, variant_attributes)
            VALUES %s
        ON CONFLICT ON CONSTRAINT variant_fact_unique DO UPDATE
        SET variant_attributes = excluded.variant_attributes || v.variant_attributes
        RETURNING variant_id;
    """

inserted = psycopg2.extras.execute_values(cur=cursor, sql=sql, argslist=argslist, template=None, page_size=50000, fetch=fetch)

In my case, argslist is a list of tuples to insert to the database. I have tried to milk this python script for speed, but this UPSERT block is not very performant. Outside of a different schema (perhaps without atomic element records), are there any ways to boost performance for upload? I have already turned off WAL for the table and removed the foreign key constraints for "start" and "stop". Am I missing anything obvious here?

jjanes · Accepted Answer

Sorting arglist by "variant_name" and "start" (the first two columns in the index) should make sure that most of the index lookups will be hitting already cached pages. Having the table also be clustered on that index would help make sure the table pages are also accessed in a cache friendly way (although it won't stay clustered very well in the face of new data).

Also, your index is gratuitously double the size it needs to be. There is no point in doing INCLUDE on a column that is already part of the main part of the index. That is going to cost you CPU and IO to format and write the data (and the WAL) and also reduce the amount of data which fits in cache.

How do I increase the speed of a bulk UPSERT in postgreSQL?

Answers (2)

Related Questions