Reputation: 1554
I would like to insert a record into a table and if the record is already present get its id, otherwise run the insert and get the new record's id.
I will be inserting millions of records and have no idea how to do this in an efficient manner. What I am doing now is to run a select to check if the record is already present, and if not, insert it and get the inserted record's id. As the table is growing I imagine that SELECT
is going to kill me.
What I am doing now in python with psycopg2 looks like this:
select = ("SELECT id FROM ... WHERE ...", [...])
cur.execute(*select)
if not cur.rowcount:
insert = ("INSERT INTO ... VALUES ... RETURNING id", [...])
cur.execute(*insert)
rid = cur.fetchone()[0]
Is it maybe possible to do something in a stored procedure like this:
BEGIN
EXECUTE sql_insert;
RETURN id;
EXCEPTION WHEN unique_violation THEN
-- return id of already existing record
-- from the exception info ?
END;
Any ideas of how optimize a case like this?
Upvotes: 3
Views: 1766
Reputation: 656441
First off, this is obviously not an UPSERT
as UPDATE
was never mentioned. Similar concurrency issues apply, though.
There will always be a race condition for this kind of task, but you can minimize it to an extremely tiny time slot, while at the same time querying for the ID only once with a data-modifying CTE (introduced with PostgreSQL 9.1):
Given a table tbl
:
CREATE TABLE tbl(tbl_id serial PRIMARY KEY, some_col text UNIQUE);
Use this query:
WITH x AS (SELECT 'baz'::text AS some_col) -- enter value(s) once
, y AS (
SELECT x.some_col
, (SELECT t.tbl_id FROM tbl t WHERE t.some_col = x.some_col) AS tbl_id
FROM x
)
, z AS (
INSERT INTO tbl(some_col)
SELECT y.some_col
FROM y
WHERE y.tbl_id IS NULL
RETURNING tbl_id
)
SELECT COALESCE(
(SELECT tbl_id FROM z)
,(SELECT tbl_id FROM y)
);
x
is only for convenience: enter values once.y
retrieves tbl_id - if it already exists.z
inserts the new row - if it doesn't.SELECT
avoids running another query on the table with the COALESCE
construct.Now, this can still fail if a concurrent transaction commits a new row with some_col = 'foo' exactly between CTE y
and z
, but that's extremely unlikely. If it happens you get a duplicate key violation and have to retry. Nothing lost. If you don't face concurrent writes, you can just forget about this.
You can put this into a plpgsql function and rerun the query on duplicate key error automatically.
Goes without saying that you need two indexes in this setup (like displayed in my CREATE TABLE
statement above):
UNIQUE
or PRIMARY KEY
constraint on tbl_id
(which is of serial
type!)UNIQUE
or PRIMARY KEY
constraint on some_col
Both implement an index automatically.
Upvotes: 2