Dashing Boy
Dashing Boy

Reputation: 499

Update a very large table in PostgreSQL without locking

I have a very large table with 100M rows in which I want to update a column with a value on the basis of another column. The example query to show what I want to do is given below:

UPDATE mytable SET col2 = 'ABCD'
WHERE col1 is not null

This is a master DB in a live environment with multiple slaves and I want to update it without locking the table or effecting the performance of the live environment. What will be the most effective way to do it? I'm thinking of making a procedure that update rows in batches of 1000 or 10000 rows using something like limit but not quite sure how to do it as I'm not that familiar with Postgres and its pitfalls. Oh and both columns don't have any indexes but table has other columns that has.

I would appreciate a sample procedure code.

Thanks.

Upvotes: 1

Views: 9902

Answers (2)

Belayer
Belayer

Reputation: 14934

Just an off-the-wall, out-of-the-box idea. Both col1 and col2 must be null to qualify precludes using an index, perhaps building a psudo index might be an option. This index would of course be a regular table but would only exist for a short period. Additionally, this relieves the lock time worry.

create table indexer (mytable_id integer  primary key);

insert into indexer(mytable_id)
select mytable_id
  from mytable
 where col1 is null
   and col2 is null;

The above creates our 'index' that contains only the qualifying rows. Now wrap an update/delete statement into an SQL function. This function updates the main table and deleted the updated rows from the 'index' and returns the number of rows remaining.

create or replace function set_mytable_col2(rows_to_process_in integer)
returns bigint
language sql
as $$
    with idx as
       ( update mytable
            set col2 = 'ABCD'
          where col2 is null
            and mytable_id in (select mytable_if 
                                 from indexer
                                limit rows_to_process_in
                               )
         returning mytable_id
       )
    delete from indexer
     where mytable_id in (select mytable_id from idx);

    select count(*) from indexer;
$$; 

When the functions returns 0 all rows initially selected have been processed. At this point repeat the entire process to pickup any rows added or updated which the initial selection didn't identify. Should be small number, and process is still available needed later.
Like I said just an off-the-wall idea.

Edited Must have read into it something that wasn't there concerning col1. However the idea remains the same, just change the INSERT statement for 'indexer' to meet your requirements. As far as setting it in the 'index' no the 'index' contains a single column - the primary key of the big table (and of itself).
Yes you would need to run multiple times unless you give it the total number rows to process as the parameter. The below is a DO block that would satisfy your condition. It processes 200,000 on each pass. Change that to fit your need.

Do $$
declare 
    rows_remaining bigint;
begin    
loop
    rows_remaining = set_mytable_col2(200000);
    commit;
    exit when rows_remaining = 0;
end loop;
end; $$; 

Upvotes: 2

Laurenz Albe
Laurenz Albe

Reputation: 247980

There is no update without locking, but you can strive to keep the row locks few and short.

You could simply run batches of this:

UPDATE mytable
SET col2 = 'ABCD'
FROM (SELECT id
      FROM mytable
      WHERE col1 IS NOT NULL
        AND col2 IS DISTINCT FROM 'ABCD'
      LIMIT 10000) AS part
WHERE mytable.id = part.id;

Just keep repeating that statement until it modifies less than 10000 rows, then you are done.

Note that mass updates don't lock the table, but of course they lock the updated rows, and the more of them you update, the longer the transaction, and the greater the risk of a deadlock.

To make that performant, an index like this would help:

CREATE INDEX ON mytable (col2) WHERE col1 IS NOT NULL;

Upvotes: 8

Related Questions