INSERT new row if value does not exist and get id either way

Question

I would like to insert a record into a table and if the record is already present get its id, otherwise run the insert and get the new record's id.

I will be inserting millions of records and have no idea how to do this in an efficient manner. What I am doing now is to run a select to check if the record is already present, and if not, insert it and get the inserted record's id. As the table is growing I imagine that SELECT is going to kill me.

What I am doing now in python with psycopg2 looks like this:

select = ("SELECT id FROM ... WHERE ...", [...])
cur.execute(*select)
if not cur.rowcount:
    insert = ("INSERT INTO ... VALUES ... RETURNING id", [...])
    cur.execute(*insert)
rid = cur.fetchone()[0]

Is it maybe possible to do something in a stored procedure like this:

BEGIN
    EXECUTE sql_insert;
    RETURN id;
    EXCEPTION WHEN unique_violation THEN
        -- return id of already existing record
        -- from the exception info ?
END;

Any ideas of how optimize a case like this?

Erwin Brandstetter · Accepted Answer

First off, this is obviously not an UPSERT as UPDATE was never mentioned. Similar concurrency issues apply, though.

There will always be a race condition for this kind of task, but you can minimize it to an extremely tiny time slot, while at the same time querying for the ID only once with a data-modifying CTE (introduced with PostgreSQL 9.1):

Given a table tbl:

CREATE TABLE tbl(tbl_id serial PRIMARY KEY, some_col text UNIQUE);

Use this query:

WITH x AS (SELECT 'baz'::text AS some_col) -- enter value(s) once

   , y AS (
   SELECT x.some_col
        , (SELECT t.tbl_id FROM tbl t WHERE t.some_col = x.some_col) AS tbl_id
   FROM   x    
   )

   , z AS (
   INSERT INTO tbl(some_col)
   SELECT y.some_col
   FROM   y
   WHERE  y.tbl_id IS NULL
   RETURNING tbl_id
)

SELECT COALESCE(
         (SELECT tbl_id FROM z)
        ,(SELECT tbl_id FROM y)
       );

CTE x is only for convenience: enter values once.
CTE y retrieves tbl_id - if it already exists.
CTE z inserts the new row - if it doesn't.
The final SELECT avoids running another query on the table with the COALESCE construct.

Now, this can still fail if a concurrent transaction commits a new row with some_col = 'foo' exactly between CTE y and z, but that's extremely unlikely. If it happens you get a duplicate key violation and have to retry. Nothing lost. If you don't face concurrent writes, you can just forget about this.

You can put this into a plpgsql function and rerun the query on duplicate key error automatically.

Goes without saying that you need two indexes in this setup (like displayed in my CREATE TABLE statement above):

a UNIQUE or PRIMARY KEY constraint on tbl_id (which is of serial type!)
another UNIQUE or PRIMARY KEY constraint on some_col

Both implement an index automatically.

INSERT new row if value does not exist and get id either way

Answers (1)

Related Questions