Update a very large table in PostgreSQL without locking

Question

I have a very large table with 100M rows in which I want to update a column with a value on the basis of another column. The example query to show what I want to do is given below:

UPDATE mytable SET col2 = 'ABCD'
WHERE col1 is not null

This is a master DB in a live environment with multiple slaves and I want to update it without locking the table or effecting the performance of the live environment. What will be the most effective way to do it? I'm thinking of making a procedure that update rows in batches of 1000 or 10000 rows using something like limit but not quite sure how to do it as I'm not that familiar with Postgres and its pitfalls. Oh and both columns don't have any indexes but table has other columns that has.

I would appreciate a sample procedure code.

Thanks.

Belayer · Accepted Answer

Just an off-the-wall, out-of-the-box idea. Both col1 and col2 must be null to qualify precludes using an index, perhaps building a psudo index might be an option. This index would of course be a regular table but would only exist for a short period. Additionally, this relieves the lock time worry.

create table indexer (mytable_id integer  primary key);

insert into indexer(mytable_id)
select mytable_id
  from mytable
 where col1 is null
   and col2 is null;

The above creates our 'index' that contains only the qualifying rows. Now wrap an update/delete statement into an SQL function. This function updates the main table and deleted the updated rows from the 'index' and returns the number of rows remaining.

create or replace function set_mytable_col2(rows_to_process_in integer)
returns bigint
language sql
as $$
    with idx as
       ( update mytable
            set col2 = 'ABCD'
          where col2 is null
            and mytable_id in (select mytable_if 
                                 from indexer
                                limit rows_to_process_in
                               )
         returning mytable_id
       )
    delete from indexer
     where mytable_id in (select mytable_id from idx);

    select count(*) from indexer;
$$;

When the functions returns 0 all rows initially selected have been processed. At this point repeat the entire process to pickup any rows added or updated which the initial selection didn't identify. Should be small number, and process is still available needed later.
Like I said just an off-the-wall idea.

Edited Must have read into it something that wasn't there concerning col1. However the idea remains the same, just change the INSERT statement for 'indexer' to meet your requirements. As far as setting it in the 'index' no the 'index' contains a single column - the primary key of the big table (and of itself).
Yes you would need to run multiple times unless you give it the total number rows to process as the parameter. The below is a DO block that would satisfy your condition. It processes 200,000 on each pass. Change that to fit your need.

Do $$
declare 
    rows_remaining bigint;
begin    
loop
    rows_remaining = set_mytable_col2(200000);
    commit;
    exit when rows_remaining = 0;
end loop;
end; $$;

Update a very large table in PostgreSQL without locking

Answers (2)

Related Questions