nlawalker
nlawalker

Reputation: 6514

Querying for changed documents in DocumentDb

Note: I asked a very similar question to this previously, but was not clear enough on exactly what I was looking for, and marked an answer too aggressively. I am looking for a confirmed yes/no on a specific point.

I want to build an automated job that performs offline processing on DocumentDb documents by querying the DocumentDb on a schedule, looking for documents that have changed since the last time the check was performed.

Given the metadata available in DocumentDb, it looks like the way to do this would be the following:

My question is is this guaranteed to work? Is it guaranteed that this will not miss any documents? As far as I can tell, it comes down to the transactional semantics around _ts within DocumentDb's implementation, which is not documented to this level of detail. I want to know if it's guaranteed that no document can be updated with a _ts value that is lower than the largest _ts value returned during a query that returns the most-recently changed document in the collection.

EDIT, prompted by David's comment:

To be a little more precise, with a couple of specific scenarios:

  1. If updates for two documents, D0 and D1, are applied to the database at T0 and T1 (where T1 > T0, such that an arbitrary query may return D0 but not D1), is it possible that D0._ts > D1._ts? The use of strictly-greater-than is intentional, as my proposed implementation deals with multiple updates receiving the same _ts but only some of them being retrieved by a query.
  2. Assume I execute my implementation's query at time T0, and the query takes a long time to run, and/or requires a couple of ExecuteNextAsync() calls to pull multiple batches from the server. During that period, 2 different documents (D1 and D2) are updated, getting _ts values of T1 and T2 (where T1 < T2). Is it possible for D2 to appear in the result set? More importantly, if it does, is D1 guaranteed to be included?

Upvotes: 0

Views: 601

Answers (1)

Larry Maccherone
Larry Maccherone

Reputation: 9523

With default consistency this is not guaranteed to work because a document with a lower _ts can show up later. However, if you can guarantee that your update requests were far enough apart (say 60 seconds) then the risk is very low.

I don't think David's edge case is a worry so long as your treat every document with a higher _ts as new.

You might also want to consider an append-only approach using something like Richard Snodgrass' temporal model. That makes the idempotency semantics easier.

Upvotes: 1

Related Questions