Reputation: 3802
I have a PostgreSQL table that has millions of record. I need to process every row and for that I am using a column in that table namely 'isProcessed'
so by default it's false
and when I process it I change it to true
.
Now the problem is that there are too many records and due to exceptions code bypasses some records leaving them isProcessed=false
and that makes the execution really slow.
I was thinking to use indexing but with boolean it does not help.
Please provide some optimization technique or some better practice.
UPDATE:
I don't have the code, It just a problem my colleagues were asking for my opinion.
Upvotes: 0
Views: 50
Reputation: 15861
Here is approach I use. You should be able to store processing state including errors. It can be one column with values PENDING
, PROCESSED
, ERROR
or two columns is_processed
, is_error
.
This is to be able skip records which couldn't be successfully processed and which if not skipped slow down processing of good tasks. You may try to reprocess them later or give DevOps possibility to move tasks from ERROR
to PENDING
state if the reason for failure for example was temporary unavailable resource.
Then you create conditional index on the table which include only PENDING
tasks.
Processing is done using following algorithm (using spring: transaction
and nestedTransaction
are spring transaction templates):
while (!(batch = getNextBatch()).isEmpty()):
transaction.execute( (TransactionStatus status) -> {
for (Element element : batch) {
try {
nestedTransaction.execute( (TransactionStatuc status ) -> {
processElement(element);
markAsProcessed(element);
});
} catch (Exception e) {
markAsFailed(element);
}
}
});
Several important notes:
processElement
rather simple update of element
then whole batch may be updated via single update statement.Upvotes: 0
Reputation: 306
Normally an index on a Boolean isn't a good idea, but in PostgreSQL you can do an index where it contains only entries for one value using a partial index http://www.postgresql.org/docs/9.3/interactive/indexes-partial.html. It ends up being a queue of things for you to process, items drop off once done.
CREATE INDEX "yourtable_isProcessed_idx" ON "public"."yourtable"
USING btree ("isProcessed")
WHERE (isProcessed IS NOT TRUE);
This will make life easier when it is looking for the next item to process. Ideally you should be processing more than one at a time, particularly if you can do it in a single query, though doing millions at once may be prohibitive. In that situation, you might be able to do
update yourtable
set ....
where id in (select id from yourtable where isProcessed = false limit 100 )
If you have to do things one at a time, I'd still limit what you retrieve, so potentially retrieve
select id from yourtable where iProcessed = false limit 1
Upvotes: 1
Reputation: 1952
Without seeing your code, it would be tough to say what is really going on. Doing any processing row by row, which it sounds like is what is going on, is going to take a VERY long time.
Generally speaking, the best way to work with data is in sets. At the end of your process, you're going to ultimately have a set of records where isProcessed
needs to be true (where the operation was successful), and a set where isProcessed
needs to be false (where the operation failed). As you process the data, keep track of which records could be updated successfully, as well as which could not be updated. You could do this by making a list or array of the primary key or whatever other data you use to identify the rows. Then, after you're done processing your data, do one update to flag the records that were successful, and one to update the records that were not successful. This will be a bit more code, but updating each row individually after you process it is going to be awfully slow.
Again, seeing code would help, but if you're updating each record after you process it, I suspect this is what's slowing you down.
Upvotes: 0