julx
julx

Reputation: 9091

Bulk/batch update/upsert in PostgreSQL

I'm writing a Django-ORM enchancement that attempts to cache models and postpone model saving until the end of the transaction. It's all almost done, however I came across an unexpected difficulty in SQL syntax.

I'm not much of a DBA, but from what I understand, databases don't really work efficiently for many small queries. Few bigger queries are much better. For example it's better to use large batch inserts (say 100 rows at once) instead of 100 one-liners.

Now, from what I can see, SQL doesn't really supply any statement to perform a batch update on a table. The term seems to be confusing so, I'll explain what I mean by that. I have an array of arbitrary data, each entry describing a single row in a table. I'd like to update certain rows in the table, each using data from its corresponding entry in the array. The idea is very similar to a batch insert.

For example: My table could have two columns "id" and "some_col". Now the array describing the data for a batch update consists of three entries (1, 'first updated'), (2, 'second updated'), and (3, 'third updated'). Before the update the table contains rows: (1, 'first'), (2, 'second'), (3, 'third').

I came accross this post:

Why are batch inserts/updates faster? How do batch updates work?

which seems to do what I want, however I can't really figure out the syntax at the end.

I could also delete all the rows that require updating and reinsert them using a batch insert, however I find it hard to believe that this would actually perform any better.

I work with PostgreSQL 8.4, so some stored procedures are also possible here. However as I plan to open source the project eventually, any more portable ideas or ways to do the same thing on a different RDBMS are most welcome.

Follow up question: How to do a batch "insert-or-update"/"upsert" statement?

Test results

I've performed 100x times 10 insert operations spread over 4 different tables (so 1000 inserts in total). I tested on Django 1.3 with a PostgreSQL 8.4 backend.

These are the results:

Conclusion: execute as many operations as possible in a single connection.execute(). Django itself introduces a substantial overhead.

Disclaimer: I didn't introduce any indices apart from default primary key indices, so insert operations could possibly run faster because of that.

Upvotes: 110

Views: 222428

Answers (7)

blindguy
blindguy

Reputation: 1009

You can also use on conflict. This let's you have one function for adding and updating.

INSERT INTO my table
(
    id,
    created_by_id,
    created_at,
    last_modified_by_id,
    last_modified_at
)
VALUES (
    gen_random_uuid(),
    :user/id,
    now(),
    :user/id,
    now()
)
ON CONFLICT (task_id) DO UPDATE
    SET last_modified_at  = now()

Or if you simply want to update, this avoids null constraint issues with the insert

update my_table t
set
  type = s.type,
  modified s.modified
from (
    values (1, 'type1', now())
  ) s (id, type, modified);
where
  t.id = s.id;

Upvotes: 1

nogus
nogus

Reputation: 1580

it is pretty fast to populate json into recordset (postgresql 9.3+)

big_list_of_tuples = [
    (1, "123.45"),
    ...
    (100000, "678.90"),
]

cursor.execute("""
    UPDATE mytable
    SET myvalue = Q.myvalue
    FROM (
        SELECT (value->>0)::integer AS id, (value->>1)::decimal AS myvalue 
        FROM json_array_elements(%s)
    ) Q
    WHERE mytable.id = Q.id
    """, 
    [json.dumps(big_list_of_tuples)]
)

Upvotes: 25

Gajus
Gajus

Reputation: 73808

You should use DO ... to batch these updates, e.g.

DO $$
DECLARE
  page_size int := 100;
  min_id integer;
  max_id integer;
BEGIN
  SELECT
    min(id),
    max(id)
  INTO
    min_id,
    max_id
  FROM user_account;
  
  FOR index IN min_id..max_id BY page_size LOOP
    UPDATE user_account
    SET user_feature_nids = get_user_group_user_features(user_group_nids)
    WHERE id >= index AND id < index+page_size;
    COMMIT;
    RAISE NOTICE '%', index;
  END LOOP;
END;
$$;

This is a real-world example of iterating through every record in the user_account table and updating user_feature_nids value using the result of another feature.

Note the COMMIT that follows the UPDATE query – this ensures that your changes are committed in batches, rather than one huge commit.

In contrast to using a standalone script, this approach does not introduce latency, so you are able to do it in smaller batches.

Upvotes: 4

hagello
hagello

Reputation: 3255

Bulk insert

You can modify the bulk insert of three columns by @Ketema:

INSERT INTO "table" (col1, col2, col3)
  VALUES (11, 12, 13) , (21, 22, 23) , (31, 32, 33);

It becomes:

INSERT INTO "table" (col1, col2, col3)
  VALUES (unnest(array[11,21,31]), 
          unnest(array[12,22,32]), 
          unnest(array[13,23,33]))

Replacing the values with placeholders:

INSERT INTO "table" (col1, col2, col3)
  VALUES (unnest(?), unnest(?), unnest(?))

You have to pass arrays or lists as arguments to this query. This means you can do huge bulk inserts without doing string concatenation (and all its hazzles and dangers: sql injection and quoting hell).

Bulk update

PostgreSQL has added the FROM extension to UPDATE. You can use it in this way:

update "table" 
  set value = data_table.new_value
  from 
    (select unnest(?) as key, unnest(?) as new_value) as data_table
  where "table".key = data_table.key;

The manual is missing a good explanation, but there is an example on the postgresql-admin mailing list. I tried to elaborate on it:

create table tmp
(
  id serial not null primary key,
  name text,
  age integer
);

insert into tmp (name,age) 
values ('keith', 43),('leslie', 40),('bexley', 19),('casey', 6);

update tmp set age = data_table.age
from
(select unnest(array['keith', 'leslie', 'bexley', 'casey']) as name, 
        unnest(array[44, 50, 10, 12]) as age) as data_table
where tmp.name = data_table.name;
 

There are also other posts on StackExchange explaining UPDATE...FROM.. using a VALUES clause instead of a subquery. They might by easier to read, but are restricted to a fixed number of rows.

Upvotes: 150

Ketema
Ketema

Reputation: 6548

Bulk inserts can be done as such:

INSERT INTO "table" ( col1, col2, col3)
  VALUES ( 1, 2, 3 ) , ( 3, 4, 5 ) , ( 6, 7, 8 );

Will insert 3 rows.

Multiple updating is defined by the SQL standard, but not implemented in PostgreSQL.

Quote:

"According to the standard, the column-list syntax should allow a list of columns to be assigned from a single row-valued expression, such as a sub-select:

UPDATE accounts SET (contact_last_name, contact_first_name) = (SELECT last_name, first_name FROM salesmen WHERE salesmen.id = accounts.sales_id);"

Reference: http://www.postgresql.org/docs/9.0/static/sql-update.html

Upvotes: 15

aliasmrchips
aliasmrchips

Reputation: 949

Turn off autocommit and just do one commit at the end. In plain SQL, this means issuing BEGIN at the start and COMMIT at the end. You would need to create a function in order to do an actual upsert.

Upvotes: 1

atrain
atrain

Reputation: 9255

I've used 3 strategies for batch transactional work:

  1. Generate SQL statements on the fly, concatenate them with semicolons, and then submit the statements in one shot. I've done up to 100 inserts in this way, and it was quite efficient (done against Postgres).
  2. JDBC has batching capabilities built in, if configured. If you generate transactions, you can flush your JDBC statements so that they transact in one shot. This tactic requires fewer database calls, as the statements are all executed in one batch.
  3. Hibernate also supports JDBC batching along the lines of the previous example, but in this case you execute a flush() method against the Hibernate Session, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.

Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize, when fetching associations, Hibernate will use IN instead of =, leading to fewer SELECT statements to load up the collections.

Upvotes: 26

Related Questions