Reputation: 1644
i have table that processed concurrently by N threads.
CREATE TABLE [dbo].[Jobs]
(
[Id] BIGINT NOT NULL CONSTRAINT [PK_Jobs] PRIMARY KEY IDENTITY,
[Data] VARBINARY(MAX) NOT NULL,
[CreationTimestamp] DATETIME2(7) NOT NULL,
[Type] INT NOT NULL,
[ModificationTimestamp] DATETIME2(7) NOT NULL,
[State] INT NOT NULL,
[RowVersion] ROWVERSION NOT NULL,
[Activity] INT NULL,
[Parent_Id] BIGINT NULL
)
GO
CREATE NONCLUSTERED INDEX [IX_Jobs_Type_State_RowVersion] ON [dbo].[Jobs]([Type], [State], [RowVersion] ASC) WHERE ([State] <> 100)
GO
CREATE NONCLUSTERED INDEX [IX_Jobs_Parent_Id_State] ON [dbo].[Jobs]([Parent_Id], [State] ASC)
GO
Job is adding to table with State=0 (New)
— it can be consumed by any worker in this state. When worker gets this queue item, State
changed to 50 (Processing)
and job becomes unavailable for other consumers (workers call [dbo].[Jobs_GetFirstByType]
with arguments: Type=any, @CurrentState=0, @NewState=50
).
CREATE PROCEDURE [dbo].[Jobs_GetFirstByType]
@Type INT,
@CurrentState INT,
@NewState INT
AS
BEGIN
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
DECLARE @JobId BIGINT;
BEGIN TRAN
SELECT TOP(1)
@JobId = Id
FROM [dbo].[Jobs] WITH (UPDLOCK, READPAST)
WHERE [Type] = @Type AND [State] = @CurrentState
ORDER BY [RowVersion];
UPDATE [dbo].[Jobs]
SET [State] = @NewState,
[ModificationTimestamp] = SYSUTCDATETIME()
OUTPUT INSERTED.[Id]
,INSERTED.[RowVersion]
,INSERTED.[Data]
,INSERTED.[Type]
,INSERTED.[State]
,INSERTED.[Activity]
WHERE [Id] = @JobId;
COMMIT TRAN
END
After processing, job State
can be changed to 0 (New)
again or it can be once set to 100 (Completed)
.
CREATE PROCEDURE [dbo].[Jobs_UpdateStatus]
@Id BIGINT,
@State INT,
@Activity INT
AS
BEGIN
UPDATE j
SET j.[State] = @State,
j.[Activity] = @Activity,
j.[ModificationTimestamp] = SYSUTCDATETIME()
OUTPUT INSERTED.[Id], INSERTED.[RowVersion]
FROM [dbo].[Jobs] j
WHERE j.[Id] = @Id;
END
Jobs has hierarchical structure, parent job gets State=100 (Completed)
only when all childs are completed.
Some workers call stored procedures ([dbo].[Jobs_GetCountWithExcludedState]
with @ExcludedState=100
) that returns number of incompleted jobs, when it returns 0, parent job State
can be set to 100 (Completed)
.
CREATE PROCEDURE [dbo].[Jobs_GetCountWithExcludedState]
@ParentId INT,
@ExcludedState INT
AS
BEGIN
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT COUNT(1)
FROM [dbo].[Jobs]
WHERE [Parent_Id] = @ParentId
AND [State] <> @ExcludedState
END
The main problem is strange behaviour of this stored procedure. Sometimes it returns 0 for parent job, but it exactly has incompleted jobs. I tryied turn on change data tracking and some debug information (including profiling) — child jobs 100% doesn't have State=100
when SP return 0.
It seems that the SP skips records, that are not in 100 (Completed)
state, but why it happen and how we can prevent this?
UPD:
Calling [dbo].[Jobs_GetCountWithExcludedState]
starts when parent job has childs. There сan be no situation when worker starts checking child jobs without their existence, because creating childs and setting to parent job checking activity wrapped in transaction:
using (var ts = new TransactionScope())
{
_jobManager.AddChilds(parentJob);
parentJob.State = 0;
parentJob.Activity = 30; // in this activity worker starts checking child jobs
ts.Complete();
}
Upvotes: 14
Views: 453
Reputation: 6713
It would be very disturbing if in fact your procedure Jobs_GetCountWithExcludedState
was returning a count of 0 records when there were in fact committed records matching your criteria. It's a pretty simple procedure. So there are two possibilities:
Corruption is an unlikely, but possible cause. You can check for corruption with DBCC CHECKDB.
Most likely there really are no committed job records that have a Parent_ID
equal to the @ParentId
parameter and are not in a state of 100 at the time it is run.
I emphasize committed because that's what the transaction will see.
You never really explain in your question how the Parent_ID
gets set on the jobs. My first thought is that maybe you are checking for unprocessed child jobs and it finds none, but then another process adds it as the Parent_ID
of another incomplete job. Is this a possibility?
I see you added an update to show that when you add a child job record that the update of the parent and child records are wrapped in a transaction. This is good, but not the question I was asking. This is the scenario that I am considering as a possibility:
Jobs_GetFirstByType
grabs the parent job.Jobs_UpdateStatus
and updates it's status to 100.Jobs_GetCountWithExcludedState
with the job and returns 0.I'm not saying that this is what is happening... I'm just asking if it's possible and what steps are you taking to prevent it? For example, in your code above in the update to your question you are selecting a ParentJob
to attach the child to outside of the transaction. Could it be that you are selecting a parent job and then it gets completed before you run the transaction that adds the child to the parent? Or maybe the last child job of a parent job completes so the worker thread checks and marks the parent complete, but some other worker thread has already selected the job to be the parent for a new child job?
There are many different scenarios that could cause the symptom you are describing. I believe that the problem is to be found in some code that you have not shared with us yet particularly about how jobs are created and the code surrounding calls to Jobs_GetCountWithExcludedState
. If you can give more information I think you will be more likely to find a usable answer, otherwise the best we can do is guess all the things that could happen in the code we can't see.
Upvotes: 2
Reputation: 2472
Your sql code looks fine. Therefore, the problem lies in how it is used.
Hypothesis #0
Procedure "Jobs_GetCountWithExcludedState" is called with a totally wrong ID. Because yes sometime problem are really just a little mistake. I doubt this is your case however.
Hypothesis #1
The code checking the field "Activity = 30" is doing it in "READ UNCOMMITED" isolation level. It would then call "Jobs_GetCountWithExcludedState" with an parentID that may not be really ready for it because the insertion transaction may not have ended yet or have been rollbacked.
Hypothesis #2
Procedure "Jobs_GetCountWithExcludedState" is called with an id that has no longer child. There could be many reason why this happen.
For example,
Hypothesis #3
Procedure "Jobs_GetCountWithExcludedState" is called before the childJob get their parentId assigned.
Conclusion
As you can see, we need more information on two things :
1. How "Jobs_GetCountWithExcludedState" is called.
2. How job are inserted. Is the parentId assigned at the insertion time or it is updated a bit later? Are they inserted in batch? Is there annexed code to it that do other stuff?
This is also where I recommend you to have a look to verify above hypothesis because the problem is most likely in the program.
Possible refactoring to invalidate all those hypothesis
Have the database tell the application which parents tasks are completed directly instead.
Upvotes: 0
Reputation: 2606
Your problem is almost certainly caused by your choice of "READ COMMITTED" isolation level. The behaviour of this depends on your configuration setting for READ_COMMITTED_SNAPSHOT, but either way it allows another transaction thread to modify records that would have been seen by your SELECT, between your SELECT and your UPDATE - so you have a race condition.
Try it again with isolation level "SERIALIZABLE" and see if that fixes your problem. For more information on isolation levels, the documentation is very helpful:
https://msdn.microsoft.com/en-AU/library/ms173763.aspx
Upvotes: 0
Reputation: 280
I have suggestion review client side and how you handle transaction and connection lifetime for each thread. Because all commands are running on client transaction.
Upvotes: -1