Reputation: 68366

SQLAlchemy IntegrityError and bulk data imports

I am inserting several 10k records into a database with REF integrity rules. Some of the rows of data are unfortunately duplicates (in that they already exist in the database). It would be too expensive to check the existence of every row in the database before inserting it so I intend to proceed by handling IntegrityError exceptions thrown by SQLAlchemy, logging the error and then continuing.

My code will look something like this:

# establish connection to db etc.

tbl = obtain_binding_to_sqlalchemy_orm()
datarows = load_rows_to_import()

try:
    conn.execute(tbl.insert(), datarows)
except IntegrityError as ie:
    # eat error and keep going
except Exception as e:
    # do something else

The (implicit) assumption I am making above is that SQLAlchemy is not rolling the multiple inserts into ONE transaction. If my assumption is wrong then it means that if an IntegrityError occurs, the rest of the insert is aborted. Can anyone confirm if the pseudocode "pattern" above will work as expected - or will I end up losing data as a result of thrown IntegrityError exceptions?

Also, if anyone has a better idea of doing this, I will be interested to hear it.

Upvotes: 11

Answers (3)

ospider

Reputation: 10401

There is almost no way to tell the sql engine to do a bulk insert on duplicate ignore action. But, we can try to do a fallback solution on the python end. If your duplicates are not distributed in a very bad way*, this pretty much will get the benefits of both worlds.

try:
    # by very bad, I mean what if each batch of the items contains one duplicate
    session.bulk_insert_mappings(mapper, items)
    session.commit()
except IntegrityError:
    logger.info("bulk inserting rows failed, fallback to one by one")
    for item in items:
        try:
            session.execute(insert(mapper).values(**item))
            session.commit()
        except SQLAlchemyError:
            logger.exception("Error inserting item: %s", item)

Upvotes: 2

osolmaz

Reputation: 1923

I also encountered this problem when I was parsing ASCII data files to import the data into a table. The problem is that I instinctively and intuitionally wanted SQLAlchemy to skip the duplicate rows while allowing the unique data. Or it could be the case that a random error is thrown with a row, due to the current SQL engine, such as unicode strings not being allowed.

However, this behavior is out of the scope of the definition of the SQL interface. SQL APIs, and hence SQLAlchemy only understand transactions and commits, and do not account for this selective behavior. Moreover, it sounds dangerous to depend on the autocommit feature, since the insertion halts after the exception, leaving the rest of the data.

My solution (which I am not sure whether it is the most elegant one) is to process every line in a loop, catch and log exceptions, and commit the changes at the very end.

Assuming that you somehow acquired data in a list of lists, i.e. list of rows which are lists of column values. Then you read every row in a loop:

# Python 3.5
from sqlalchemy import Table, create_engine
import logging

# Create the engine
# Create the table
# Parse the data file and save data in `rows`

conn = engine.connect() 
trans = conn.begin() # Disables autocommit

exceptions = {}
totalRows = 0
importedRows = 0

ins = table.insert()

for currentRowIdx, cols in enumerate(rows):
    try:
        conn.execute(ins.values(cols)) # try to insert the column values
        importedRows += 1

    except Exception as e:
        exc_name = type(e).__name__ # save the exception name
        if not exc_name in exceptions:
            exceptions[exc_name] = []
        exceptions[exc_name].append(currentRowIdx)

    totalRows += 1

for key, val in exceptions.items():
    logging.warning("%d out of %d lines were not imported due to %s."%(len(val), totalRows, key))

logging.info("%d rows were imported."%(importedRows))

trans.commit() # Commit at the very end
conn.close()

In order to maximize the speed in this operation, you should disable autocommit. I am using this code with SQLite and it is still 3-5 times slower than my older version using only sqlite3, even with autocommit disabled. (The reason that I ported to SQLAlchemy was to be able to use it with MySQL.)

It is not the most elegant solution in the sense that it is not as fast as a direct interface to SQLite. If I profile the code and find the bottleneck in near future, I will update this answer with the solution.

Upvotes: 0

mata

Reputation: 69012

it may work like this, if you didn't start any transaction before, as in this case sqlalchemy's autocommit feature will kick in. but you should explicitly set as described in the link.

Upvotes: 2

SQLAlchemy IntegrityError and bulk data imports

Answers (3)

Related Questions