Reputation: 5530
I need to select, make manipulation and update a lot of data for less than 3 minutes. And was decided to create some kind of locking mechanism to make the ability to run separate processes (in parallel) and each process should lock, select and update own rows.
To make it possible was decided to add the column worker_id
to the table.
Table structure:
CREATE TABLE offers
(
id int(10) unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
offer_id int(11) NOT NULL,
offer_sid varchar(255) NOT NULL,
offer_name varchar(255),
account_name varchar(255),
worker_id varchar(255),
);
CREATE UNIQUE INDEX offers_offer_id_offer_sid_unique ON offers (offer_id, offer_sid);
CREATE INDEX offers_offer_id_index ON offers (offer_id);
CREATE INDEX offers_offer_sid_index ON offers (offer_sid);
Also, we decided to start from 5 parallel processes and to not allow selection of the same row by different processes we are using the formula: offer_id % max_amount_of_processes = process_number
(process_number starting from 0, so first is 0 and last is 4)
Each process is following the steps:
worker_id
with current process id to the first 1000 rows using the query: update offers set worker_id =: process_id where worker_id is null and offer_id%5 =: process_number order by offer_id asc limit 1000
select * from offers where worker_id =: process_id
order by offer_id asc limit 1000
offer_id
to the variable and prepared data to another variable for further updateand offer_id > :last_selected_id
to select next 1000 rowsupdate offers set worker_id = null where worker_id =: process_id
and the same steps for other 4 processes
The issue here is that I'm getting a deadlock when all 5 processes simultaneously run the query from step 1 to lock rows (set worker_id
) but each process doing lock for own rows which depending on the formula. I tried to set transaction isolation level to READ COMMITED
but still the same issue.
I'm a novice in the locking mechanism and I need a help to prevent deadlocks here or to create the better mechanism
Upvotes: 1
Views: 2976
Reputation: 562230
The expression offer_id%5 = :process_number
cannot use an index, so it can only scan all the rows matched by the first condition, worker_id is null
.
You can prove this with two windows:
mysql1> begin;
mysql1> set @p=1;
mysql1> update offers set worker_id = @p where worker_id is null and offer_id%5 = @p;
Don't commit the transaction in window 1 yet.
mysql2> set @p=2;
mysql2> update offers set worker_id = @p where worker_id is null and offer_id%5 = @p;
...waits for about 50 seconds, or value of innodb_lock_wait_timeout, then...
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
This demonstrates that each concurrent session locks overlapping sets of rows, not only the rows that match the modulus expression. So the sessions queue up against each other's locks.
This will get worse if you put all the steps into a transaction like @SloanThrasher suggests. Making the work of each worker take longer will make them hold only their locks longer, and further delay the other processes waiting on those locks.
I do not understand how updated_at field can cause the issue as I'm still updating other fields
I'm not sure because you haven't posted the InnoDB deadlock diagnostics from SHOW ENGINE INNODB STATUS
.
I do notice that your table has a secondary UNIQUE KEY, which will also require locks. There are some cases of deadlocks that occur because of non-atomicity of the lock assignment.
Worker 1 Worker 2
UPDATE SET worker_id = 1
(acquires locks on PK)
UPDATE SET worker_id = 2
(waits for PK locks held by worker 1)
(waits for locks on UNIQUE KEY)
Both worker 1 and worker 2 can therefore be waiting on each other, and enter into a deadlock.
This is just a guess. Another possibility is that the ORM is doing a second UPDATE for the updated_at
column, and this introduces another opportunity for a race condition. I haven't quite worked that out mentally, but I think it's possible.
Below is a recommendation for a different system that would avoid these problems:
There's another problem, that you're not really balancing the work over your processes to achieve the best completion time. There might not be an equal number of offers in each group when you split them by modulus. And each offer might not take the same amount of time to process anyway. So some of your workers could finish and have nothing to do, while the last worker is still processing its work.
You can solve both problems, the locking and the load-balancing:
Change the table columns in the following way:
ALTER TABLE offers
CHANGE worker_id work_state ENUM('todo', 'in progress', 'done') NOT NULL DEFAULT 'todo',
ADD INDEX (work_state),
ADD COLUMN updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
ADD INDEX (updated_at);
Create ONE process that reads from the table periodically, and adds the primary key id
values of offers in a 'todo' state to a message queue. All the offers, regardless of their offer_id value, get queued in the same way.
SELECT id FROM offers WHERE work_state = 'todo'
/* push each id onto the queue */
Then each of the workers can pull one id
at a time from the message queue. The worker does the following steps with each id:
UPDATE offers SET work_state = 'in progress' WHERE id = :id
The worker performs the work for its one offer.
UPDATE offers SET work_state = 'done' WHERE id = :id
These worker queries only reference one offer at a time, and they address the offers by primary key, which will use the PK index and only lock one row at a time.
Once it has finished one offer, then the worker pulls the next offer from the queue.
In this way, the workers will all finish at the same time, and the work will be balanced over the workers better. Also you can start or stop workers at any time, and you don't care about what worker number they are, because your offers don't need to be processed by a worker with the same number as the modulus of the offer_id.
When the workers finish all the offers, the message queue will be empty. Most message queues allow workers to do blocking reads, so while the queue is empty, the worker will just wait for the read to return. When you use a database, the workers have to poll frequently for new work.
There's a chance a worker will fail during its work, and never mark an offer 'done'. You need to check periodically for orphaned offers. Assume they are not going to be completed, and mark their state 'todo'.
UPDATE offers SET work_state = 'todo'
WHERE work_state = 'in progress' AND updated_at < NOW() - INTERVAL 5 MINUTE
Choose the interval length so it's certain that any worker would have finished it by that time unless something had gone wrong. You would probably do this "reset" before the dispatcher queries for current offers todo, so the offers that had been forgotten will be re-queued.
Upvotes: 1
Reputation: 5530
I found the issue. It was because my ORM is by default updating timestamp fields (to simplify the example above I removed them from table structure) while doing an update operation, and after I turn it off the deadlock disappeared. But still, I do not understand how updated_at
field can cause the issue as I'm still updating other fields
Upvotes: 0