Reputation: 1178
We are using Cosmos DB with Spring data
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-spring-data-cosmos</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-cosmos</artifactId>
<version>4.8.0</version>
</dependency>
For some requests, (intermittently), we are seeing some errors like -
1. Failed to upsert item; nested exception is CosmosException
2. Failed to find items; nested exception is CosmosException : Item should be available for which the query was made.
3. DocumentProducer - Unexpected failure
{userAgent=azsdk-java-cosmos/4.8.0 Linux/5.3.0-1035-azure JRE/1.8.0_291, error=null, resourceAddress='null', requestUri='null', statusCode=0, message=null, {"userAgent":"azsdk-java-cosmos/4.8.0 Linux/5.3.0-1035-azure JRE/1.8.0_291","requestLatencyInMs":60073,"requestStartTimeUTC":"2021-06-25T05:31:17.150Z","requestEndTimeUTC":"2021-06-25T05:32:17.223Z",
"connectionMode":"GATEWAY","responseStatisticsList":[],"supplementalResponseStatisticsList":[],"addressResolutionStatistics":{},"regionsContacted":[],"retryContext":{"retryCount":0,"statusAndSubStatusCodes":null,"retryLatency":0},"metadataDiagnosticsContext":{"metadataDiagnosticList":null},
"serializationDiagnosticsContext":{"serializationDiagnosticsList":[{"serializationType":"ITEM_SERIALIZATION","startTimeUTC":"2021-06-25T05:31:17.150Z","endTimeUTC":"2021-06-25T05:31:17.150Z","durationInMicroSec":0},{"serializationType":"PARTITION_KEY_FETCH_SERIALIZATION","startTimeUTC":"2021-06-25T05:31:17.150Z","endTimeUTC":"2021-06-25T05:31:17.150Z","durationInMicroSec":0}]},
"gatewayStatistics":{"sessionToken":null,"operationType":"Upsert","statusCode":0,"subStatusCode":10002,"requestCharge":null,"requestTimeline":null},"systemInformation":{"usedMemory":"264619 KB","availableMemory":"3789909 KB","systemCpuLoad":"(2021-06-25T05:31:50.445Z 2.0%), (2021-06-25T05:31:55.445Z 2.0%),
(2021-06-25T05:32:00.445Z 2.0%), (2021-06-25T05:32:05.445Z 2.0%), (2021-06-25T05:32:10.445Z 2.0%), (2021-06-25T05:32:15.445Z 2.0%)"},"clientCfgs":{"id":0,"numberOfClients":2,"connCfg":{"rntbd":null,"gw":"(cps:1000, rto:PT5S, icto:PT1M, p:false)","other":"(ed: true, cs: false)"},
"consistencyCfg":"(consistency: null, mm: true, prgns: [southcentralus,westus])"}},
causeInfo=[class: class io.netty.handler.timeout.ReadTimeoutException, message: null], responseHeaders={x-ms-substatus=10002}, requestHeaders=[Accept=application/json, x-ms-date=Fri, 25 Jun 2021 05:31:17 GMT, x-ms-documentdb-partitionkey=["xxxxx"], x-ms-documentdb-is-upsert=true, Content-Type=application/json]}
Stacktrace does not show much info, I can attach here if needed. In all these cases, we are seeing only one response header in the log -
responseHeaders={x-ms-substatus=10002}
We have below questions on this -
What is the meaning of this status? Our Cosmos Help team is asking for Activityid which we are not able to find in the logs for these errors. How can I add the activityId in these cases too? I can't use ResponseDiagnosticsProcessor as it needs other spring cosmos dependency.
What are the possible reason for these errors?
In stacktrace, why all these are showing -
io.netty.handler.timeout.ReadTimeoutException
also?
How can we debug these errors?
Note:. We are not using Gremlin API and connecting in Gateway Mode AND we are seeing these errors very less, may be less than 0.05% of the total Cosmos requests
Upvotes: 2
Views: 1824
Reputation: 15603
Substatus 10002 is related to HTTP timeouts. Since you are on Gateway mode, it makes sense that your operations are executing through HTTP.
When designing or coding a distributed application you always need to account for timeouts. The SDK does retry on timeouts (based on https://learn.microsoft.com/azure/cosmos-db/troubleshoot-java-sdk-v4-sql#retry-logic-) but applications should always have some layer of timeout retries, since they can happen for various reasons (network blips, sporadic resource contention).
You mentioned that you see this less than 0.05% of total requests, which is not bad, but it's worth checking the areas in the guide for Java timeouts: https://learn.microsoft.com/azure/cosmos-db/troubleshoot-request-timeout-java-sdk-v4-sql
Timeouts could also happen due to a service side latency spike (it's not likely but it can happen), and this is one area that the support team you mentioned might help. From the diagnostics, it looks like you have a request timeout of 5 seconds, so any Gateway requests with more than 5 second latency would generate one .
Upvotes: 2