Bogdan  Dubyk
Bogdan Dubyk

Reputation: 5530

mysql deadlock while updating

I need to select, make manipulation and update a lot of data for less than 3 minutes. And was decided to create some kind of locking mechanism to make the ability to run separate processes (in parallel) and each process should lock, select and update own rows.

To make it possible was decided to add the column worker_id to the table.

Table structure:

CREATE TABLE offers
(
    id int(10) unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
    offer_id int(11) NOT NULL,
    offer_sid varchar(255) NOT NULL,
    offer_name varchar(255),
    account_name varchar(255),
    worker_id varchar(255),
);
CREATE UNIQUE INDEX offers_offer_id_offer_sid_unique ON offers (offer_id, offer_sid);
CREATE INDEX offers_offer_id_index ON offers (offer_id);
CREATE INDEX offers_offer_sid_index ON offers (offer_sid);

Also, we decided to start from 5 parallel processes and to not allow selection of the same row by different processes we are using the formula: offer_id % max_amount_of_processes = process_number (process_number starting from 0, so first is 0 and last is 4)

Each process is following the steps:

  1. set worker_id with current process id to the first 1000 rows using the query: update offers set worker_id =: process_id where worker_id is null and offer_id%5 =: process_number order by offer_id asc limit 1000
  2. select those rows: select * from offers where worker_id =: process_id order by offer_id asc limit 1000
  3. make manipulation with data, store last offer_id to the variable and prepared data to another variable for further update
  4. run the same query from step 1 to lock next 1000 rows
  5. run the same query as we have in step 2 with additional where clause and offer_id > :last_selected_id to select next 1000 rows
  6. make the same steps in the loop until we lock all rows
  7. remove all locks update offers set worker_id = null where worker_id =: process_id
  8. run the query to update all collected data

and the same steps for other 4 processes

The issue here is that I'm getting a deadlock when all 5 processes simultaneously run the query from step 1 to lock rows (set worker_id) but each process doing lock for own rows which depending on the formula. I tried to set transaction isolation level to READ COMMITED but still the same issue.

I'm a novice in the locking mechanism and I need a help to prevent deadlocks here or to create the better mechanism

Upvotes: 1

Views: 2976

Answers (2)

Bill Karwin
Bill Karwin

Reputation: 562230

The expression offer_id%5 = :process_number cannot use an index, so it can only scan all the rows matched by the first condition, worker_id is null.

You can prove this with two windows:

mysql1> begin;
mysql1> set @p=1;
mysql1> update offers set worker_id = @p where worker_id is null and offer_id%5 = @p;

Don't commit the transaction in window 1 yet.

mysql2> set @p=2;
mysql2> update offers set worker_id = @p where worker_id is null and offer_id%5 = @p;
...waits for about 50 seconds, or value of innodb_lock_wait_timeout, then...
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

This demonstrates that each concurrent session locks overlapping sets of rows, not only the rows that match the modulus expression. So the sessions queue up against each other's locks.

This will get worse if you put all the steps into a transaction like @SloanThrasher suggests. Making the work of each worker take longer will make them hold only their locks longer, and further delay the other processes waiting on those locks.


I do not understand how updated_at field can cause the issue as I'm still updating other fields

I'm not sure because you haven't posted the InnoDB deadlock diagnostics from SHOW ENGINE INNODB STATUS.

I do notice that your table has a secondary UNIQUE KEY, which will also require locks. There are some cases of deadlocks that occur because of non-atomicity of the lock assignment.

Worker 1                               Worker 2            

UPDATE SET worker_id = 1
(acquires locks on PK)

                                        UPDATE SET worker_id = 2
                                        (waits for PK locks held by worker 1)

(waits for locks on UNIQUE KEY)

Both worker 1 and worker 2 can therefore be waiting on each other, and enter into a deadlock.

This is just a guess. Another possibility is that the ORM is doing a second UPDATE for the updated_at column, and this introduces another opportunity for a race condition. I haven't quite worked that out mentally, but I think it's possible.

Below is a recommendation for a different system that would avoid these problems:


There's another problem, that you're not really balancing the work over your processes to achieve the best completion time. There might not be an equal number of offers in each group when you split them by modulus. And each offer might not take the same amount of time to process anyway. So some of your workers could finish and have nothing to do, while the last worker is still processing its work.

You can solve both problems, the locking and the load-balancing:

Change the table columns in the following way:

ALTER TABLE offers
  CHANGE worker_id work_state ENUM('todo', 'in progress', 'done') NOT NULL DEFAULT 'todo',
  ADD INDEX (work_state),
  ADD COLUMN updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  ADD INDEX (updated_at);

Create ONE process that reads from the table periodically, and adds the primary key id values of offers in a 'todo' state to a message queue. All the offers, regardless of their offer_id value, get queued in the same way.

SELECT id FROM offers WHERE work_state = 'todo'
/* push each id onto the queue */

Then each of the workers can pull one id at a time from the message queue. The worker does the following steps with each id:

  1. UPDATE offers SET work_state = 'in progress' WHERE id = :id

  2. The worker performs the work for its one offer.

  3. UPDATE offers SET work_state = 'done' WHERE id = :id

These worker queries only reference one offer at a time, and they address the offers by primary key, which will use the PK index and only lock one row at a time.

Once it has finished one offer, then the worker pulls the next offer from the queue.

In this way, the workers will all finish at the same time, and the work will be balanced over the workers better. Also you can start or stop workers at any time, and you don't care about what worker number they are, because your offers don't need to be processed by a worker with the same number as the modulus of the offer_id.

When the workers finish all the offers, the message queue will be empty. Most message queues allow workers to do blocking reads, so while the queue is empty, the worker will just wait for the read to return. When you use a database, the workers have to poll frequently for new work.

There's a chance a worker will fail during its work, and never mark an offer 'done'. You need to check periodically for orphaned offers. Assume they are not going to be completed, and mark their state 'todo'.

UPDATE offers SET work_state = 'todo' 
WHERE work_state = 'in progress' AND updated_at < NOW() - INTERVAL 5 MINUTE

Choose the interval length so it's certain that any worker would have finished it by that time unless something had gone wrong. You would probably do this "reset" before the dispatcher queries for current offers todo, so the offers that had been forgotten will be re-queued.

Upvotes: 1

Bogdan  Dubyk
Bogdan Dubyk

Reputation: 5530

I found the issue. It was because my ORM is by default updating timestamp fields (to simplify the example above I removed them from table structure) while doing an update operation, and after I turn it off the deadlock disappeared. But still, I do not understand how updated_at field can cause the issue as I'm still updating other fields

Upvotes: 0

Related Questions