Reputation: 27
Since Nov 08 2022, 16h UTC, we sometimes get the following DatastoreException with code: UNAVAILABLE, and message:
Query timed out. Please try either limiting the entities scanned, or run with an updated index configuration.
I want to get all Keys of a certain kind of entities. These are returned in batches together with a new cursor. When using the cursor to get the next batch, then the above stated error happens. I am expecting that the query does not time out so fast. (It might be that it takes up to a few seconds until I am requesting the next batch of Keys using the returned cursor, but this never used to be a problem in the past.)
There no problem before the automatic upgrade to Firestore. Also counting entities of a kind often results in the error DatastoreException: "The datastore operation timed out, or the data was temporarily unavailable."
I am wondering whether I have to make any changes on my side. Does anybody else encounter these problems with Firestore in Datastore mode?
What is meant by "an updated index configuration"?
Thanks Stefan
Upvotes: 1
Views: 1164
Reputation: 119
I just wanted to follow up here since we were able to do detailed analysis and come up with a workaround. I wanted to record our findings here for posterity's sake.
The root of the problem is queries over large ranges of deleted keys. Given schema like:
Kind: ExampleKind
Data:
Key | lastUpdatedMillis |
---|---|
ExampleKind/1040 | 5 |
ExampleKind/1052 | 0 |
ExampleKind/1064 | 12 |
ExampleKind/1065 | 100 |
ExampleKind/1070 | 42 |
Datastore will automatically generate both ASC and DESC index on the lastUpdatedMillis
field.
The the lastUpdatedMillis ASC
index table would have the following logical entries:
Index Key | Entity Key |
---|---|
0 | ExampleKind/1052 |
5 | ExampleKind/1040 |
12 | ExampleKind/1064 |
42 | ExampleKind/1070 |
100 | ExampleKind/1065 |
In the workload you've described, there was an operation that did the following:
SELECT * FROM ExampleKind WHERE lastUpdatedMillis <= nowMillis()
lastUpdatedMillis
When the operation completes, there are large key ranges in the index tables that are deleted, but in the storage system these rows still exist with special deletion markers. They are visible internally to queries, but are filtered in the results:
Index Key | Entity Key |
---|---|
x | xxxx |
x | xxxx |
x | xxxx |
42 | ExampleKind/1070 |
... | Und so weiter ... |
x | xxxx |
When we repeat the query over this data, if the number of deleted rows is very large (100_000 ... 1_000_000), the storage system may spend the entire operation looking for non-deleted data in this range. Eventually the Garbage Collection and Compaction mechanisms will remove the deleted rows and querying this key range becomes fast again.
A reasonable is workaround is to reduce the amount of work the query has to do by restricting the time range of the lastUpdateMillis
field.
For example, instead of scanning the entire range of lastUpdateMillis < now
, we could break up the query into:
(now - 60 minutes) <= lastUpdateMillis < now
(now - 120 minutes) <= lastUpdateMillis < (now - 60 minutes)
(now - 180 minutes) <= lastUpdateMillis < (now - 120 minutes)
This example uses 60 minute ranges, however the specific "chunk size" can be tuned to the shape of your data. These smaller queries will either succeed and find some results, or scan the entire key range and return 0 results, however in both scenarios they will complete within the RPC deadline.
Thank you again for reaching out about this!
A couple notes:
Upvotes: 2