Reputation: 1
I have about 200,000,000 rows and I am trying to update one of the columns, and this query seems particularly slow, so I am not sure what exactly wrong or if it is just slow.
UPDATE table1 p
SET location = a.location
FROM table2 a
WHERE p.idv = a.idv;
I curently on idv for both of the tables. Is there someway to make this faster?
Upvotes: 0
Views: 1387
Reputation: 101
I love Postgres but my impression is that, as a RDBMS with row-based storage, it is just not optimized for such operations. I had to use Postgres for an OLAP data warehouse with relatively small tables (the largest have ~20M rows ~400-500 columns), and appending new columns is pretty terrible. I haven't found a way to go around the fact that Postgres would need to find the end of every row, fetch the respective page, and append the new value. This is a tremendous amount of disc I/O and I am not sure it is possible to avoid. My current solution is to pool my transformations into a few large data ingestion steps, and then use the "append by rewriting the full table" solution proposed above. It feels very suboptimal but in my particular application, it kind of does the job. If your DB requires appending lots of new columns with relatively few, or even no new rows, maybe consider using something like Apache Kafka. However, I would be very happy if somebody proposes a solution in Postgres. In particular, tuning the configuration, e.g. larger WAL, etc., might improve things but it won't solve the huge amount of disc I/O required. Partitioning the table, if there is a suitable attribute to partition along, should also greatly improve the operation, given that the different partitions are small enough.
Upvotes: 0
Reputation: 10163
Creating a new table can be faster than update existing data. So you can try the following:
CREATE TABLE new_table AS
SELECT
a.*, -- here you can set all fields you need
CALESCE(b.location, a.location) location -- update location field from table2
FROM table1 a
LEFT JOIN table2 b ON b.idv = a.idv;
After creation you will be able to drop old table and to rename the new.
Upvotes: 1
Reputation: 90457
Encounter the same problem several weeks ago , finally I use the following strategies to drastically improve the speed. I guess it is not the best approach , but just for your reference.
Write a simple function which accept a range of Id. The function will execute the update SQL but just update these range of ID.
Also add 'location != a.location' to the where clause . I heard that it can help to reduce the table become bloated which will affect query performance and need to do vacuum to restore the performance.
I execute the function continuously using about 30 threads which intuitively I think it can reduce the total time required by approximately 30 times. You can adjust to use a even higher number of threads if you are ambitious enough.
So it executes something likes below concurrently :
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 1 and 100000;
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 100001 and 200000;
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 200001 and 300000;
.....
.....
Also it has another advantage that I can know the update progress and what is the estimated remaining time to go by printing some simple timing statistic in each function.
Upvotes: 1