Fernando Briano
Fernando Briano

Reputation: 7767

Database various connections vs. one

We have this PHP application which selects a row from the database, works on it (calls an external API which uses a webservice), and then inserts a new register based on the work done. There's an AJAX display which informs the user of how many registers have been processed.

The data is mostly text, so it's rather heavy data.

The process is made by thousands of registers a time. The user can choose how many registers to start working on. The data is obtained from one table, where they are marked as "done". No "WHERE" condition, except the optional "WHERE date BETWEEN date1 AND date2".

We had an argument over which approach is better:

Which approach do you consider the most efficient one for a web environment with PHP and PostgreSQL? Why?

Upvotes: 0

Views: 112

Answers (2)

gahooa
gahooa

Reputation: 137352

It really depends how much you care about your data (seriously):

Does reliability matter in this case? If the process dies, can you just re-process everything? Or can't you?

Typically when calling a remote web service, you don't want to be calling it twice for the same data item. Perhaps there are side effects (like credit card charges), or maybe it is not a free API...

Anyway, if you don't care about potential duplicate processing, then take the batch approach. It's easy, it's simple, and fast.

But if you do care about duplicate processing, then do this:

  1. SELECT 1 record from the table FOR UPDATE (ie. lock it in a transaction)
  2. UPDATE that record with a status of "Processing"
  3. Commit that transaction

And then

  1. Process the record
  2. Update the record contents, AND
  3. SET the status to "Complete", or "Error" in case of errors.

You can run this code concurrently without fear of it running over itself. You will be able to have confidence that the same record will not be processed twice.

You will also be able to see any records that "didn't make it", because their status will be "Processing", and any errors.

Upvotes: 1

Kyle Rosendo
Kyle Rosendo

Reputation: 25277

If the data is heavy and so is the load, considering the application is not real time dependant the best approach is most definately getting the needed data and working on all of it, then putting it back.

Efficiency speaking, regardless of language is that if you are opening single items, and working on them individually, you are probably closing the database connection. This means that if you have 1000's of items, you will open and close 1000's of connections. The overhead on this far outweighs the overhead of returning all of the items and working on them.

Upvotes: 0

Related Questions