Ben Ben Sasson
Ben Ben Sasson

Reputation: 103

What is the best approach for upserting large number of rows into a single table?

Im working on a product that involves large number of upsert operations into a single table.

We are dealing with a time-based data and using timescaledb hypertables with 7 days chunk interval size. we have concurrent tasks that upserts data into a single table, and in extreme cases its possible that we will have 40 concurrent tasks, each one upserting around 250k rows, all to the same table.

Initially we decided to go with the approach of deleting all the old rows and then inserting the updated ones with a COPY FROM statement, but when we got to test the system on large scale these COPYs took long time to finish, eventually resulting in the db's CPU usage to reach 100%, and become unresponsive. We also noticed that the index size of the table increased radically and filled up the disk usage to 100%, and SELECT statements took extremely long time to execute (over 10 minutes). We concluded that the reason for that was large amount of delete statements that caused index fragmentation, and decided to go with another approach.

following the answers on this post, we decided to copy all the data to a temporary table, and then upsert all the data to the actual table using an "extended insert" statement -

INSERT INTO table SELECT * FROM temp_table ON CONFLICT DO UPDATE...; 

our tests show that it helped with the index fragmentation issues, but still large upsert operations of ~250K take over 4 minutes to execute, and during this upsert process SELECT statements take too long to finish which is unacceptable for us.

I'm wondering whats the best approach to create this upsert operation with as low impact to the performance of SELECTs as possible. The only thing that comes in mind right now is to split the insert into smaller chunks -

INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 0 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 50000 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 100000 ON CONFLICT DO UPDATE ...;
...

but if we batch the inserts, is there any advantage of first copying all the data into a temporary table? will it perform better then a simple multi-row insert statement? and how do i decide whats the best chunk size to use when splitting up the upsert? is using a temporary table and upserting the rows directly from it allows for a bigger chunk sizes?

Is there any better approach to achieve this? any suggestion would be appreciated

Upvotes: 1

Views: 1517

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 247950

There are a handful of things that you can do to speed up data loading:

  • Have no index or foreign key on the table while you load data (check constraints are fine). I am not suggesting that you drop all your constraints, but you could for example use partitioning, load the data into a new table, then create indexes and constraints and attach the table as a new partition.

  • Load the data with COPY. If you cannot use COPY, use a prepared statement for the INSERT to save on parsing time.

  • Load many rows in a single transaction.

  • Set max_wal_size high so that you get no more checkpoints than necessary.

  • Get fast local disks.

Upvotes: 4

Related Questions