Reputation: 997
I am trying to speed up inserts into a schema. I have seen some questions already about this however they do not deal with my issue, these are the most relevant:
The tables within my schema all include auto generated keys Serial
, so I am using RETURNING id
at the end of the inserts. The reason I am not using copy
is that I need to check for constraints. Specifically some fields have unique indices, therefore I am checking if a specific row already exists so I can select its id (For the sake of mapping on the auto generated keys)
I do this:
sql = "INSERT INTO %s.%s(%s) VALUES (%s) ON CONFLICT DO NOTHING RETURNING %s;"
sql = sql % (schema_name,table_name,','.join(head[1:]),'%s',head[0])
where head contains the name of the fields in the table. Then I do this:
try:
# get the auto generated id back
return_id = pdb.cur.fetchone()[0]
except TypeError:
sql_exist_id = "SELECT %s FROM %s.%s WHERE %s = '%s'" % (head[0],schema_name,table_name,unique_field,row[1])
My question is: Is there a way to do a batch insert (like this psycopg2: insert multiple rows with one query) all while checking for constraints and returning keys ?
Upvotes: 0
Views: 538
Reputation: 5065
You can get pretty close...
The RETURNING
clause will work on multi-value inserts, like so:
insert into some_table (col1, col2) values
(val1, val2), (val3, val4),
(val5, val6), (val7, val8)
returning id
Assuming some_table
has a serial
, you will get 4 results back from that
statement. It will also work with insert into ... select
:
insert into some_table (col1, col2)
select col1, col2
from some_other_source returning id
But in both cases though, if you specify on conflict do nothing
, the entire statement will fail in case of a conflict, so one bad record kills the transaction.
But note that returning
also works on deletes:
delete from some_table where col1 = 'bob' returning id
Putting it together with the suggestion from @JacobH: bulk load to a temp
table, filter out conflicting rows, then insert the remaining, non-conflicting
rows. You mentioned copy
, so I'm assuming there is already .csv that roughly
matches the destination. Something like this:
with conn.cursor() as stmt:
#create a constraint-free temp table based on your destination table
#the `where 0 = 1` ensures it's empty
stmt.execute('select * into temp tmp_some_table from some_table where 0 = 1')
#bulk load into the temp table:
with open('some.csv') as infile:
stmt.copy_from(infile, 'tmp_some_table', sep=',', columns=['col1', 'col2'])
#strip out your conflicts. note that we are returning col1 (assumed to be a unique key)
#you can do as many passes as needed...
stmt.execute('delete from tmp_some_table tst using some_table st where tst.col1 = st.col1 returning col1')
dupe_keys = stmt.fetchall() #all your deleted, conflicting keys
#bulk insert non-conflicting values
stmt.execute('insert into some_table (col1, col2) select (col1, col2) from tmp_some_table returning id')
inserted_ids = stmt.fetchall() #all the ids created from the insert
A bonus is that this should be very fast vs. iterating and calling execute
-
in essence you are doing everthing in 4 statements; the bulk of the network I/O
is on that initial csv import vs. N round trips on N record inserts.
Everything else happens in the database. The temp table will probably stay in
memory if it's reasonably small.
Realize I'm late to the part on this answer, but I hope it helps.
Upvotes: 2