Reputation: 15414
I have an Items and Jobs table:
Items
Jobs
Items start out as IN_PROGRESS, but work is performed on them, and handed off to a worker to update. I have an updater process that is updating Items as they come in, with a new status. The approach I have been doing so far has been (in pseudocode):
def work(item: Item) = {
insideTransaction {
updateItemWithNewStatus(item)
jobs, items = getParentJobAndAllItems(item)
newJobStatus = computeParentJobStatus(jobs, items)
// do some stuff depending on newJobStatus
}
}
Does that make sense? I want this to work in a concurrent environment. The issue I have right now, is that COMPLETE is arrived at multiple times for a job, when I only want to do logic on COMPLETE, once.
If I change my transaction level to SERIALIZABLE, I do get the "ERROR: could not serialize access due to read/write dependencies among transactions" error as described.
So my questions are:
Edit: I have reopened this question because I was not satisfied with the previous answers explanation. Is anyone able to explain this for me? Specifically, I want some example queries for that pseudocode.
Upvotes: 13
Views: 1998
Reputation: 2846
If you want the jobs to be able to run concurrently, neither SERIALIZABLE
nor SELECT FOR UPDATE
will work directly.
If you lock the row using SELECT FOR UPDATE
, then another process will simply block when it executes the SELECT FOR UPDATE
until the first process commits the transaction.
If you do SERIALIZABLE
, both processes could run concurrently (processing the same row) but at least one should be expected to fail by the time it does a COMMIT
since the database will detect the conflict. Also SERIALIZABLE
might fail if it conflicts with any other queries going on in the database at the same time which affect related rows. The real reason to use SERIALIZABLE
is precisely if you are trying to protect against concurrent database updates made by other jobs, as opposed to blocking the same job from executing twice.
Note there are tricks to make SELECT FOR UPDATE
skip locked rows. If you do that then you can have actual concurrency. See Select unlocked row in Postgresql.
Another approach I see more often is to change your "status" column to have a 3rd temporary state which is used while a job is being processed. Typically one would have states like 'PENDING', 'IN_PROGRESS', 'COMPLETE'. When your process searches for work to do, it finds a 'PENDING' jobs, immediately moves it to 'IN_PROGRESS' and commits the transaction, then carries on with the work and finally moves it to 'COMPLETE'. The disadvantage is that if the process dies while processing a job, it will be left in 'IN_PROGRESS' indefinitely.
Upvotes: 3
Reputation: 32364
You can use a SELECT FOR UPDATE
on items
and jobs
and work on the affected rows in both tables within a single transaction. That should be enough to enforce the integrity of the whole operation without the overhead of SERIALIZABLE
or a table lock.
I would suggest you create a function that is called after an insert or update is made on the items
table, passing the PK of the item:
CREATE FUNCTION process_item(item integer) RETURNS void AS $$
DECLARE
item items%ROWTYPE;
job jobs%ROWTYPE;
BEGIN -- Implicitly starting a transaction
SELECT * INTO job FROM jobs
WHERE id = (SELECT job_id FROM items WHERE id = item)
FOR UPDATE; -- Lock the row for other users
FOR item IN SELECT * FROM items FOR UPDATE LOOP -- Rows locked
-- Work on items individually
UPDATE items
SET status = 'COMPLETED'
WHERE id = item.id;
END LOOP;
-- Do any work on the job itself
END; -- Implicitly close the transaction, releasing the locks
$$ LANGUAGE plpgsql;
If some other process is already work on the job or any of its associated items, then the execution will halt until that other lock is released. This is different from SERIALIZABLE
which will work until it fails and then you'd have to re-do all of the processing in a second try.
Upvotes: 5