Horse
Horse

Reputation: 3063

mysql locking rows in a select query for update

So I am writing a simple website crawler for maintenance of in house sites. It will go through each link, adding new links as it finds them, noting down title and h1 tags etc.

Occasionally it duplicates titles and H1 tags, when there is only one in the source when I check it manually.

The reason this is happening is because the crawl script is running via cron and it appears to be overlapping, so processing the same page twice.

The script will basically grab a page that has been uncrawled, then if the http response is 200 it will mark it as crawled, and process what it needs to.

So somewhere between the SELECT and the UPDATE, another thread of the script is running on the same row that was SELECTed.

Is there a way to either SELECT and UPDATE in the same query, or lock the row returned in the SELECT so it cannot be returned again in another query in another thread until I am finished with it?

Have had a look at - http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html and general SELECT FOR UPDATE stuff, but I am still unsure.

Edit

I am using something like this

START TRANSACTION;
SELECT .. FOR UPDATE;
UPDATE .... ;
COMMIT;

But its not liking it. Am def using InnoDB on that table. I am thinking this may not be the way forward, as it simply puts off the processing of the row until after the commit, when I want it to physically not be able to SELECT the row again.

I have covered this off by doing the SELECT, and then afterwards doing an UPDATE to flag a field as crawled before it processes it, but the fact that this is not seamless seems to be causing the problem. I need a way to seamlessly SELECT and UPDATE the field, or SELECT and stop it being SELECTed again until I say so.

Upvotes: 5

Views: 10100

Answers (2)

snap
snap

Reputation: 2792

You answered the question yourself :). SELECT FOR UPDATE is exactly what you need if I understand your question correctly. Remember to turn off autocommit, start a transaction before select and commit the transaction after update.

Update:

I think this will do what you want:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
SELECT .. FOR UPDATE;
UPDATE .... ;
COMMIT TRANSACTION;

Upvotes: 3

Maxim Krizhanovsky
Maxim Krizhanovsky

Reputation: 26699

When you lock the row (via select for update), the other transaction will wait for lock to be released, instead of skipping the row and selecting the next one. Better strategy will be to have a flag colum in the table (none, processing, completed), maybe with a timestamp. The cron grabs row, sets flag to processing and starts processing the page. When another instance of the script is running, it selects rows that are not in the 'processing' state. When cron finishs, it updates the record once again to 'completed'

Upvotes: 3

Related Questions