Jacob George
Jacob George

Reputation: 2627

removing error query during commit

I have postgresql db which i am updating with around 100000 records. I use session.merge() to insert/update each record and i do a commit after every 1000 records.

i=0
for record in records:
    i+=1
    session.merge(record)
    if i%1000 == 0:
        session.commit()

This code works fine. In my database i have a table with a UNIQUE field and there are some duplicated records that i insert into it. A error is thrown when this happens, saying the field is not unique. Since i am inserting 1000 records at a time, a rollback will not help me to skip these records. is there any way i can skip the session.merge() for the duplicate records (other than parsing through all the records to find the duplicate records of course)?

Upvotes: 2

Views: 241

Answers (4)

Jacob George
Jacob George

Reputation: 2627

This is the option which works best for me because the number of records with duplicate unique keys is minimal.

def update_exception(records, i, failed_records):
    failed_records.append(records[i]['pk'])
    session.rollback()
    start_range = int(round(i/1000,0) * 1000)
    for index in range(start_range, i+1):
        if records[index]['pk'] not in failed_records:
            ins_obj = Model()
            try:
                session.merge(ins_obj)
            except:
                failed_records.append(json_data[table_name][index-1]['pk'])
                pass

Say, if i hit an error at 2375 I store the primary key 'pk' for the 2375 record in failed_records and then i recommit from 2000 to 2375. It seems much faster than doing commits one by one.

Upvotes: 0

zzzeek
zzzeek

Reputation: 75137

you can get at a "partial rollback" using SAVEPOINT, which SQLAlchemy exposes via begin_nested(). You could do it just like this:

for i, record in enumerate(records):
    try:
        with session.begin_nested():
            session.merge(record)
    except:
        print "Skipped record %s" % record
    if not i % 1000:
        session.commit()

notes for the above:

  1. in python, we never do the "i = i+1" thing. use enumerate().
  2. with session.begin_nested(): is the same as saying begin_nested(), then commit() if no exception, or rollback() if so.

Upvotes: 2

kgrittn
kgrittn

Reputation: 19471

You might want to consider writing a function along the lines of this example from the PostgreSQL documentation.

Upvotes: 1

JosefAssad
JosefAssad

Reputation: 4128

I think you already know this, but let's start out with a piece of dogma: you specified that the field needs to be unique, so you have to let the database check for uniqueness or deal with the errors from not letting that happen.

Checking for uniqueness:

if value not in database:
    session.add(value)
session.commit()

Not checking for uniqueness and catching the exception.

try:
    session.add(value)
    session.commit()
except IntegrityError:
    session.rollback()

The first one has a race condition. I tend to use the second pattern.

Now, bringing this back to your actual issue, if you want to assure uniqueness on a column in the database then obviously you're going to have to either let the db assure itself of the loaded value's actual uniqueness, or let the database give you an error and you handle it.

That's obviously a lot slower than adding 100k objects to the session and just committing them all, but that's how databases work.

You might want to consider massaging the data which you are loading OUTSIDE the database and BEFORE attempting to load it, to ensure uniqueness. That way, when you load it you can drop the need to check for uniqueness. Pretty easy to do with command line tools if for example you're loading from csv or text files.

Upvotes: 3

Related Questions