Reputation: 163
I am trying to load many millions of data records, from multiple distinct sources, to a postgresql table with the following design:
CREATE TABLE public.variant_fact (
variant_id bigint NOT NULL,
ref_allele text NOT NULL,
allele text NOT NULL,
variant_name text NOT NULL,
start bigint,
stop bigint,
variant_attributes jsonb
);
ALTER TABLE public.variant_fact
ADD CONSTRAINT variant_fact_unique UNIQUE (variant_name, start, stop, allele, ref_allele)
INCLUDE (ref_allele, allele, variant_name, start, stop);
Where "start" and "stop" are foreign keys and "variant_id" is an auto-incrementing primary key. I am running into issues with the loading speed because in order to perform the UPSERT, I need to check the table to see whether an element exists for each element I upload. I am performing the operation in python using psycopg2 using the execute_values method.
insert_query = """
INSERT INTO variant_fact AS v (variant_id, ref_allele, allele, variant_name, start, stop, variant_attributes)
VALUES %s
ON CONFLICT ON CONSTRAINT variant_fact_unique DO UPDATE
SET variant_attributes = excluded.variant_attributes || v.variant_attributes
RETURNING variant_id;
"""
inserted = psycopg2.extras.execute_values(cur=cursor, sql=sql, argslist=argslist, template=None, page_size=50000, fetch=fetch)
In my case, argslist is a list of tuples to insert to the database. I have tried to milk this python script for speed, but this UPSERT block is not very performant. Outside of a different schema (perhaps without atomic element records), are there any ways to boost performance for upload? I have already turned off WAL for the table and removed the foreign key constraints for "start" and "stop". Am I missing anything obvious here?
Upvotes: 0
Views: 1609
Reputation: 44423
Sorting arglist by "variant_name" and "start" (the first two columns in the index) should make sure that most of the index lookups will be hitting already cached pages. Having the table also be clustered on that index would help make sure the table pages are also accessed in a cache friendly way (although it won't stay clustered very well in the face of new data).
Also, your index is gratuitously double the size it needs to be. There is no point in doing INCLUDE on a column that is already part of the main part of the index. That is going to cost you CPU and IO to format and write the data (and the WAL) and also reduce the amount of data which fits in cache.
Upvotes: 3
Reputation: 248295
Turning off WAL (setting the table UNLOGGED
) means that the table will be empty after a crash, because it cannot be recovered. If you are considering running ALTER TABLE
later to change it to a LOGGED
table, know that this operation will dump the whole table into WAL, so you won't win anything.
For a simple statement like that on an unlogged table, the only way to speed it up are:
drop all indexes, triggers and constraints except variant_fact_unique
– but creating them again will be expensive, so you might not win overall
make sure you have fast storage and enough RAM
Upvotes: 1