AnarKi
AnarKi

Reputation: 997

How to speed up insert while checking for contraints and returning values with python or postgres

I am trying to speed up inserts into a schema. I have seen some questions already about this however they do not deal with my issue, these are the most relevant:

The tables within my schema all include auto generated keys Serial, so I am using RETURNING id at the end of the inserts. The reason I am not using copy is that I need to check for constraints. Specifically some fields have unique indices, therefore I am checking if a specific row already exists so I can select its id (For the sake of mapping on the auto generated keys)

I do this:

sql = "INSERT INTO %s.%s(%s) VALUES (%s) ON CONFLICT DO NOTHING RETURNING %s;"
sql = sql % (schema_name,table_name,','.join(head[1:]),'%s',head[0])

where head contains the name of the fields in the table. Then I do this:

    try:
        # get the auto generated id back
        return_id = pdb.cur.fetchone()[0]     
    except TypeError:
        sql_exist_id = "SELECT %s FROM %s.%s WHERE %s = '%s'" % (head[0],schema_name,table_name,unique_field,row[1])

My question is: Is there a way to do a batch insert (like this psycopg2: insert multiple rows with one query) all while checking for constraints and returning keys ?

Upvotes: 0

Views: 538

Answers (1)

bimsapi
bimsapi

Reputation: 5065

You can get pretty close...

The RETURNING clause will work on multi-value inserts, like so:

insert into some_table (col1, col2) values 
    (val1, val2), (val3, val4), 
    (val5, val6), (val7, val8) 
    returning id 

Assuming some_table has a serial, you will get 4 results back from that statement. It will also work with insert into ... select:

insert into some_table (col1, col2) 
    select col1, col2 
      from some_other_source returning id

But in both cases though, if you specify on conflict do nothing, the entire statement will fail in case of a conflict, so one bad record kills the transaction.

But note that returning also works on deletes:

delete from some_table where col1 = 'bob' returning id

Putting it together with the suggestion from @JacobH: bulk load to a temp table, filter out conflicting rows, then insert the remaining, non-conflicting rows. You mentioned copy, so I'm assuming there is already .csv that roughly matches the destination. Something like this:

with conn.cursor() as stmt:
    #create a constraint-free temp table based on your destination table
    #the `where 0 = 1` ensures it's empty
    stmt.execute('select * into temp tmp_some_table from some_table where 0 = 1') 

    #bulk load into the temp table:
    with open('some.csv') as infile:
        stmt.copy_from(infile, 'tmp_some_table', sep=',', columns=['col1', 'col2'])

    #strip out your conflicts. note that we are returning col1 (assumed to be a unique key)
    #you can do as many passes as needed...
    stmt.execute('delete from tmp_some_table tst using some_table st where tst.col1 = st.col1 returning col1')
    dupe_keys = stmt.fetchall() #all your deleted, conflicting keys

    #bulk insert non-conflicting values
    stmt.execute('insert into some_table (col1, col2) select (col1, col2) from tmp_some_table returning id') 
    inserted_ids = stmt.fetchall() #all the ids created from the insert

A bonus is that this should be very fast vs. iterating and calling execute - in essence you are doing everthing in 4 statements; the bulk of the network I/O is on that initial csv import vs. N round trips on N record inserts. Everything else happens in the database. The temp table will probably stay in memory if it's reasonably small.

Realize I'm late to the part on this answer, but I hope it helps.

Upvotes: 2

Related Questions