Reputation: 97
I am using psycopg2 in python, but my question is DBMS agnostic (as long as the DBMS supports transactions):
I am writing a python program that inserts records into a database table. The number of records to be inserted is more than a million. When I wrote my code so that it ran a commit on each insert statement, my program was too slow. Hence, I altered my code to run a commit every 5000 records and the difference in speed was tremendous.
My problem is that at some point an exception occurs when inserting records (some integrity check fails) and I wish to commit my changes up to that point, except of course for the last command that caused the exception to happen, and continue with the rest of my insert statements.
I haven't found a way to achieve this; the only thing I've achieved was to capture the exception, rollback my transaction and keep on from that point, where I loose my pending insert statements. Moreover, I tried (deep)copying the cursor object and the connection object without any luck, either.
Is there a way to achieve this functionality, either directly or indirectly, without having to rollback and recreate/re-run my statements?
Thank you all in advance,
George.
Upvotes: 1
Views: 1351
Reputation: 21
If you are committing your transactions after every 5000 record interval, it seems like you could do a little bit of preprocessing of your input data and actually break it out into a list of 5000 record chunks, i.e. [[[row1_data],[row2_data]...[row4999_data]],[[row5000_data],[row5001_data],...],[[....[row1000000_data]]]
Then run your inserts, and keep track of which chunk you are processing as well as which record you are currently inserting. When you get the error, you rerun the chunk, but skip the the offending record.
Upvotes: 2
Reputation: 324731
I doubt you'll find a fast cross-database way to do this. You just have to optimize the balance between the speed gains from batch size and the speed costs of repeating work when an entry causes a batch to fail.
Some DBs can continue with a transaction after an error, but PostgreSQL can't. However, it does allow you to create subtransactions with the SAVEPOINT
command. These are far from free, but they're lower cost than a full transaction. So what you can do is every (say) 100 rows, issue a SAVEPOINT
and then release the prior savepoint. If you hit an error, ROLLBACK TO SAVEPOINT
, commit, then pick up where you left off.
Upvotes: 3