blueberry12
blueberry12

Reputation: 11

best way to upsert 300 million entries into postgres?

I have a new csv file every day with 400 million+ entries which I need to upsert into my database (3 tables with 2 foreign keys, indexed). The majority of the entries are already in the table, in which case I need to update a column. Some entries, which are not already in the table need to be inserted.

I tried to insert the CSV each day into a temptable then run:

INSERT INTO restaurants (name, food_id, street_id, datecreated, lastdayobservedopen) SELECT DISTINCT temptable.name, typesoffood.food_id, location.street_id, temptable.datecreated, temptable.lastdayobservedopen FROM temptable INNER JOIN typesoffood on typesoffood.food_type = temptable.food_type INNER JOIN location ON location.street_name = temptable.street_name ON CONFLICT ON CONSTRAINT restaurants_pk DO UPDATE SET lastdayobservedopen = EXCLUDED.lastdayobservedopen

But it takes over 6 hrs.

Is it possible to make this faster?

Edit:

Some more details: 3 tables- restaurants(name, food_id, street_id, datecreated, lastdayobservedopen) with pk (name, street_id) and fks (food_id and street_id); typesoffood(food_id, food_type) with pk (food_id) and index on food_type; location(street_id, street_name) with pk (street_id) and index on street_name; as for the csv file, I don’t know which are new or old entries, but I do know that the majority of the entries are already in the database which would require me to update the lastdayobserved date. The rest are to be inserted with the lastdayobserved date as today. This is supposed to help distinguish between restaurants that are no longer in operation (in which case their lastdayobserved column would not be updated) and currently operating restaurants whose date in that column should always match today’s date. Open to more efficient schema suggestions, as well. Thanks to all!

Upvotes: 0

Views: 585

Answers (2)

Jonathan Porter
Jonathan Porter

Reputation: 1556

If you can change you postgres settings you could take advantage of parallelism in Postgres. Otherwise you could at least speed up the csv upload using Postgres's bulk upload otherwise known as the COPY command.

Without more details it's hard to give better advice.

Upvotes: 0

Alex Zhong
Alex Zhong

Reputation: 21

There is a function in sql called bulk insert can handle large volume of data:

bulk insert #temp
from "file location path"

Upvotes: 1

Related Questions