Sachin Verma
Sachin Verma

Reputation: 3802

Optimization for fetching from a bulky table

I have a PostgreSQL table that has millions of record. I need to process every row and for that I am using a column in that table namely 'isProcessed' so by default it's false and when I process it I change it to true.

Now the problem is that there are too many records and due to exceptions code bypasses some records leaving them isProcessed=false and that makes the execution really slow.

I was thinking to use indexing but with boolean it does not help.

Please provide some optimization technique or some better practice.

UPDATE:

I don't have the code, It just a problem my colleagues were asking for my opinion.

Upvotes: 0

Views: 50

Answers (3)

Here is approach I use. You should be able to store processing state including errors. It can be one column with values PENDING, PROCESSED, ERROR or two columns is_processed, is_error.

This is to be able skip records which couldn't be successfully processed and which if not skipped slow down processing of good tasks. You may try to reprocess them later or give DevOps possibility to move tasks from ERROR to PENDING state if the reason for failure for example was temporary unavailable resource.

Then you create conditional index on the table which include only PENDING tasks.

Processing is done using following algorithm (using spring: transaction and nestedTransaction are spring transaction templates):

while (!(batch = getNextBatch()).isEmpty()):

   transaction.execute( (TransactionStatus status) -> {
         for (Element element : batch) {
                try {
                    nestedTransaction.execute( (TransactionStatuc status ) -> {
                         processElement(element);
                         markAsProcessed(element);
                    });
                } catch (Exception e) {
                    markAsFailed(element);
                } 
         }
    });   

Several important notes:

  1. getting of records is done in batches - this at least saves round trips to database and is quicker then one by one retrieval
  2. processing of individual elements is done in nested transaction (this is implemented using postgresql SAVEPOINTs). This is quicker then processing each element in own transaction but have the benefit that failure in processing of one element will not lose results of processing of others elements in batch.
  3. This is good when processing is complex enough and cannot be done in sql by single query to process batch. If processElement rather simple update of element then whole batch may be updated via single update statement.
  4. processing on elements of the batch may be done in parallel. This requires propagation of transaction to worker threads.

Upvotes: 0

DerekCate
DerekCate

Reputation: 306

Normally an index on a Boolean isn't a good idea, but in PostgreSQL you can do an index where it contains only entries for one value using a partial index http://www.postgresql.org/docs/9.3/interactive/indexes-partial.html. It ends up being a queue of things for you to process, items drop off once done.

CREATE INDEX "yourtable_isProcessed_idx" ON "public"."yourtable"
USING btree ("isProcessed")
WHERE (isProcessed IS NOT TRUE);

This will make life easier when it is looking for the next item to process. Ideally you should be processing more than one at a time, particularly if you can do it in a single query, though doing millions at once may be prohibitive. In that situation, you might be able to do

update yourtable
set ....
where id in (select  id from yourtable where isProcessed = false limit 100 )

If you have to do things one at a time, I'd still limit what you retrieve, so potentially retrieve

select id from yourtable where iProcessed = false limit 1

Upvotes: 1

Patrick Tucci
Patrick Tucci

Reputation: 1952

Without seeing your code, it would be tough to say what is really going on. Doing any processing row by row, which it sounds like is what is going on, is going to take a VERY long time.

Generally speaking, the best way to work with data is in sets. At the end of your process, you're going to ultimately have a set of records where isProcessed needs to be true (where the operation was successful), and a set where isProcessed needs to be false (where the operation failed). As you process the data, keep track of which records could be updated successfully, as well as which could not be updated. You could do this by making a list or array of the primary key or whatever other data you use to identify the rows. Then, after you're done processing your data, do one update to flag the records that were successful, and one to update the records that were not successful. This will be a bit more code, but updating each row individually after you process it is going to be awfully slow.

Again, seeing code would help, but if you're updating each record after you process it, I suspect this is what's slowing you down.

Upvotes: 0

Related Questions