Boo
Boo

Reputation: 1644

Select query skips records during concurrent updates

i have table that processed concurrently by N threads.

CREATE TABLE [dbo].[Jobs]
(
    [Id]                    BIGINT          NOT NULL    CONSTRAINT [PK_Jobs] PRIMARY KEY IDENTITY,
    [Data]                  VARBINARY(MAX)  NOT NULL,
    [CreationTimestamp]     DATETIME2(7)    NOT NULL,
    [Type]                  INT             NOT NULL,
    [ModificationTimestamp] DATETIME2(7)    NOT NULL,
    [State]                 INT             NOT NULL,
    [RowVersion]            ROWVERSION      NOT NULL,
    [Activity]              INT                 NULL,
    [Parent_Id]             BIGINT              NULL
)
GO

CREATE NONCLUSTERED INDEX [IX_Jobs_Type_State_RowVersion] ON [dbo].[Jobs]([Type], [State], [RowVersion] ASC) WHERE ([State] <> 100)
GO

CREATE NONCLUSTERED INDEX [IX_Jobs_Parent_Id_State] ON [dbo].[Jobs]([Parent_Id], [State] ASC)
GO

Job is adding to table with State=0 (New) — it can be consumed by any worker in this state. When worker gets this queue item, State changed to 50 (Processing) and job becomes unavailable for other consumers (workers call [dbo].[Jobs_GetFirstByType] with arguments: Type=any, @CurrentState=0, @NewState=50).

CREATE PROCEDURE [dbo].[Jobs_GetFirstByType]
    @Type           INT,
    @CurrentState   INT,
    @NewState       INT
AS
BEGIN

    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

    DECLARE @JobId BIGINT;

    BEGIN TRAN

        SELECT      TOP(1)
                    @JobId = Id
        FROM        [dbo].[Jobs] WITH (UPDLOCK, READPAST)
        WHERE       [Type] = @Type AND [State] = @CurrentState
        ORDER BY    [RowVersion];


        UPDATE  [dbo].[Jobs]

        SET     [State] = @NewState,
                [ModificationTimestamp] = SYSUTCDATETIME()

        OUTPUT  INSERTED.[Id]
                ,INSERTED.[RowVersion]
                ,INSERTED.[Data]
                ,INSERTED.[Type]
                ,INSERTED.[State]
                ,INSERTED.[Activity]

        WHERE   [Id] = @JobId;

    COMMIT TRAN

END

After processing, job State can be changed to 0 (New) again or it can be once set to 100 (Completed).

CREATE PROCEDURE [dbo].[Jobs_UpdateStatus]
    @Id         BIGINT,
    @State      INT,
    @Activity   INT
AS
BEGIN

    UPDATE  j

    SET     j.[State] = @State,
            j.[Activity] = @Activity,
            j.[ModificationTimestamp] = SYSUTCDATETIME()

    OUTPUT  INSERTED.[Id], INSERTED.[RowVersion]

    FROM    [dbo].[Jobs] j

    WHERE   j.[Id] = @Id;

END

Jobs has hierarchical structure, parent job gets State=100 (Completed) only when all childs are completed. Some workers call stored procedures ([dbo].[Jobs_GetCountWithExcludedState] with @ExcludedState=100) that returns number of incompleted jobs, when it returns 0, parent job State can be set to 100 (Completed).

CREATE PROCEDURE [dbo].[Jobs_GetCountWithExcludedState]
    @ParentId       INT,
    @ExcludedState  INT
AS
BEGIN

    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

    SELECT  COUNT(1)

    FROM    [dbo].[Jobs]

    WHERE   [Parent_Id] = @ParentId
    AND     [State] <> @ExcludedState

END

The main problem is strange behaviour of this stored procedure. Sometimes it returns 0 for parent job, but it exactly has incompleted jobs. I tryied turn on change data tracking and some debug information (including profiling) — child jobs 100% doesn't have State=100 when SP return 0. It seems that the SP skips records, that are not in 100 (Completed) state, but why it happen and how we can prevent this?

UPD: Calling [dbo].[Jobs_GetCountWithExcludedState] starts when parent job has childs. There сan be no situation when worker starts checking child jobs without their existence, because creating childs and setting to parent job checking activity wrapped in transaction:

using (var ts = new TransactionScope())
{
    _jobManager.AddChilds(parentJob);

    parentJob.State = 0;
    parentJob.Activity = 30; // in this activity worker starts checking child jobs

    ts.Complete();
}

Upvotes: 14

Views: 453

Answers (4)

Brian Pressler
Brian Pressler

Reputation: 6713

It would be very disturbing if in fact your procedure Jobs_GetCountWithExcludedState was returning a count of 0 records when there were in fact committed records matching your criteria. It's a pretty simple procedure. So there are two possibilities:

  • The query is failing due to an issue with SQL Server or data corruption.
  • There actually are no committed records matching the criteria at the time the procedure is run.

Corruption is an unlikely, but possible cause. You can check for corruption with DBCC CHECKDB.

Most likely there really are no committed job records that have a Parent_ID equal to the @ParentId parameter and are not in a state of 100 at the time it is run.

I emphasize committed because that's what the transaction will see.

You never really explain in your question how the Parent_ID gets set on the jobs. My first thought is that maybe you are checking for unprocessed child jobs and it finds none, but then another process adds it as the Parent_ID of another incomplete job. Is this a possibility?

I see you added an update to show that when you add a child job record that the update of the parent and child records are wrapped in a transaction. This is good, but not the question I was asking. This is the scenario that I am considering as a possibility:

  • A Job Record is inserted and committed for the parent.
  • Jobs_GetFirstByType grabs the parent job.
  • A worker thread processes it and calls Jobs_UpdateStatus and updates it's status to 100.
  • Something calls Jobs_GetCountWithExcludedState with the job and returns 0.
  • A child job is created and attached to the completed parent job record... which makes it now incomplete again.

I'm not saying that this is what is happening... I'm just asking if it's possible and what steps are you taking to prevent it? For example, in your code above in the update to your question you are selecting a ParentJob to attach the child to outside of the transaction. Could it be that you are selecting a parent job and then it gets completed before you run the transaction that adds the child to the parent? Or maybe the last child job of a parent job completes so the worker thread checks and marks the parent complete, but some other worker thread has already selected the job to be the parent for a new child job?

There are many different scenarios that could cause the symptom you are describing. I believe that the problem is to be found in some code that you have not shared with us yet particularly about how jobs are created and the code surrounding calls to Jobs_GetCountWithExcludedState. If you can give more information I think you will be more likely to find a usable answer, otherwise the best we can do is guess all the things that could happen in the code we can't see.

Upvotes: 2

AXMIM
AXMIM

Reputation: 2472

Your sql code looks fine. Therefore, the problem lies in how it is used.

Hypothesis #0
Procedure "Jobs_GetCountWithExcludedState" is called with a totally wrong ID. Because yes sometime problem are really just a little mistake. I doubt this is your case however.


Hypothesis #1
The code checking the field "Activity = 30" is doing it in "READ UNCOMMITED" isolation level. It would then call "Jobs_GetCountWithExcludedState" with an parentID that may not be really ready for it because the insertion transaction may not have ended yet or have been rollbacked.


Hypothesis #2
Procedure "Jobs_GetCountWithExcludedState" is called with an id that has no longer child. There could be many reason why this happen.
For example,

  • Transaction that inserted child job failed for whatever reason but this procedure was called anyway.
  • A single child job was deleted and was about to be replace.
  • etc

Hypothesis #3
Procedure "Jobs_GetCountWithExcludedState" is called before the childJob get their parentId assigned.


Conclusion
As you can see, we need more information on two things :
1. How "Jobs_GetCountWithExcludedState" is called.
2. How job are inserted. Is the parentId assigned at the insertion time or it is updated a bit later? Are they inserted in batch? Is there annexed code to it that do other stuff?

This is also where I recommend you to have a look to verify above hypothesis because the problem is most likely in the program.


Possible refactoring to invalidate all those hypothesis
Have the database tell the application which parents tasks are completed directly instead.

  • Just like "Jobs_GetFirstByType", there could be Jobs_GetFirstParentJobToComplete" which could return the next uncompleted parent job with completed childs if any. It could also be a view that return all of them. Either ways, usage of "Jobs_GetCountWithExcludedState" would then no longer be used thus invaliding all my hypothesis. The new procedure or view should be READ COMMITTED or above.

Upvotes: 0

cliffordheath
cliffordheath

Reputation: 2606

Your problem is almost certainly caused by your choice of "READ COMMITTED" isolation level. The behaviour of this depends on your configuration setting for READ_COMMITTED_SNAPSHOT, but either way it allows another transaction thread to modify records that would have been seen by your SELECT, between your SELECT and your UPDATE - so you have a race condition.

Try it again with isolation level "SERIALIZABLE" and see if that fixes your problem. For more information on isolation levels, the documentation is very helpful:

https://msdn.microsoft.com/en-AU/library/ms173763.aspx

Upvotes: 0

user2989845
user2989845

Reputation: 280

I have suggestion review client side and how you handle transaction and connection lifetime for each thread. Because all commands are running on client transaction.

Upvotes: -1

Related Questions