Reputation: 14692
Using Postgres and sqlalchemy.
I have a job scans a large table and for each row does some calculation and updates some related tables. I am told that I should issue periodic commits inside the loop in order not to keep a large amount of in-memory data. I wonder such commits have a performance penalty, e.g. restarting a transaction, taking db snapshot perhaps etc.
Would using a flush()
be better in this case?
Upvotes: 0
Views: 80
Reputation: 247535
An open transaction won't keep a lot of data in memory.
The advice you got was probably from somebody who is used to Oracle, where large transactions cause problems with UNDO.
The question is how you scan the large table:
If you snarf the large table to the client and then update the related tables, it won't matter much if you commit in between or not.
If you use a cursor to scan the large table (which is normally better), you'd have to create a WITH HOLD
cursor if you want the cursor to work across transactions. Such a cursor is materialized on the database server side and so will use more resources on the database.
The alternative would be to use queries for the large table that fetch only part of the table and chunk the operation that way.
That said, there are reasons why one big transaction might be better or worse than many smaller ones:
You can use a normal cursor to scan the big table and don't have to bother with WITH HOLD
cursors or the alternative as indicated above.
You'd have transactional guarantees for the whole operation. For example, you can simply restart the operation after an error and rollback.
Shorter transactions reduce the risk of deadlocks.
Shorter transactions allow autovacuum to clean up the effects of previous batches while later batches are being processed. This is a notable advantage if there is a lot of data churn due to the updates, as it will help keep table bloat small.
The best choice depends on the actual situation.
Upvotes: 2