Postgres concurrency and serializability. Do I need a SERIALIZABLE isolation level?

Question

I have an Items and Jobs table:

Items

id = PK
job_id = Jobs FK
status = IN_PROGRESS | COMPLETE

Jobs

id = PK

Items start out as IN_PROGRESS, but work is performed on them, and handed off to a worker to update. I have an updater process that is updating Items as they come in, with a new status. The approach I have been doing so far has been (in pseudocode):

def work(item: Item) = {
  insideTransaction {
    updateItemWithNewStatus(item)
    jobs, items = getParentJobAndAllItems(item)
    newJobStatus = computeParentJobStatus(jobs, items)
    // do some stuff depending on newJobStatus
  }
}

Does that make sense? I want this to work in a concurrent environment. The issue I have right now, is that COMPLETE is arrived at multiple times for a job, when I only want to do logic on COMPLETE, once.

If I change my transaction level to SERIALIZABLE, I do get the "ERROR: could not serialize access due to read/write dependencies among transactions" error as described.

So my questions are:

Do I need SERIALIZABLE?
Can I get away with SELECT FOR UPDATE, and where?
Can someone explain to me what is happening, and why?

Edit: I have reopened this question because I was not satisfied with the previous answers explanation. Is anyone able to explain this for me? Specifically, I want some example queries for that pseudocode.

zeroimpl · Accepted Answer

If you want the jobs to be able to run concurrently, neither SERIALIZABLE nor SELECT FOR UPDATE will work directly.

If you lock the row using SELECT FOR UPDATE, then another process will simply block when it executes the SELECT FOR UPDATE until the first process commits the transaction.

If you do SERIALIZABLE, both processes could run concurrently (processing the same row) but at least one should be expected to fail by the time it does a COMMIT since the database will detect the conflict. Also SERIALIZABLE might fail if it conflicts with any other queries going on in the database at the same time which affect related rows. The real reason to use SERIALIZABLE is precisely if you are trying to protect against concurrent database updates made by other jobs, as opposed to blocking the same job from executing twice.

Note there are tricks to make SELECT FOR UPDATE skip locked rows. If you do that then you can have actual concurrency. See Select unlocked row in Postgresql.

Another approach I see more often is to change your "status" column to have a 3rd temporary state which is used while a job is being processed. Typically one would have states like 'PENDING', 'IN_PROGRESS', 'COMPLETE'. When your process searches for work to do, it finds a 'PENDING' jobs, immediately moves it to 'IN_PROGRESS' and commits the transaction, then carries on with the work and finally moves it to 'COMPLETE'. The disadvantage is that if the process dies while processing a job, it will be left in 'IN_PROGRESS' indefinitely.

Postgres concurrency and serializability. Do I need a SERIALIZABLE isolation level?

Answers (2)

Related Questions