Vivek Vardhan
Vivek Vardhan

Reputation: 1178

Cosmos Java SDK throwing exceptions with x-ms-substatus=10002

We are using Cosmos DB with Spring data

        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-spring-data-cosmos</artifactId>
            <version>3.1.0</version>
        </dependency>

        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-cosmos</artifactId>
            <version>4.8.0</version>
        </dependency>

For some requests, (intermittently), we are seeing some errors like -

1. Failed to upsert item; nested exception is CosmosException

2. Failed to find items; nested exception is CosmosException : Item should be available for which the query was made.

3. DocumentProducer - Unexpected failure

{userAgent=azsdk-java-cosmos/4.8.0 Linux/5.3.0-1035-azure JRE/1.8.0_291, error=null, resourceAddress='null', requestUri='null', statusCode=0, message=null, {"userAgent":"azsdk-java-cosmos/4.8.0 Linux/5.3.0-1035-azure JRE/1.8.0_291","requestLatencyInMs":60073,"requestStartTimeUTC":"2021-06-25T05:31:17.150Z","requestEndTimeUTC":"2021-06-25T05:32:17.223Z",
"connectionMode":"GATEWAY","responseStatisticsList":[],"supplementalResponseStatisticsList":[],"addressResolutionStatistics":{},"regionsContacted":[],"retryContext":{"retryCount":0,"statusAndSubStatusCodes":null,"retryLatency":0},"metadataDiagnosticsContext":{"metadataDiagnosticList":null},
"serializationDiagnosticsContext":{"serializationDiagnosticsList":[{"serializationType":"ITEM_SERIALIZATION","startTimeUTC":"2021-06-25T05:31:17.150Z","endTimeUTC":"2021-06-25T05:31:17.150Z","durationInMicroSec":0},{"serializationType":"PARTITION_KEY_FETCH_SERIALIZATION","startTimeUTC":"2021-06-25T05:31:17.150Z","endTimeUTC":"2021-06-25T05:31:17.150Z","durationInMicroSec":0}]},
"gatewayStatistics":{"sessionToken":null,"operationType":"Upsert","statusCode":0,"subStatusCode":10002,"requestCharge":null,"requestTimeline":null},"systemInformation":{"usedMemory":"264619 KB","availableMemory":"3789909 KB","systemCpuLoad":"(2021-06-25T05:31:50.445Z 2.0%), (2021-06-25T05:31:55.445Z 2.0%), 
(2021-06-25T05:32:00.445Z 2.0%), (2021-06-25T05:32:05.445Z 2.0%), (2021-06-25T05:32:10.445Z 2.0%), (2021-06-25T05:32:15.445Z 2.0%)"},"clientCfgs":{"id":0,"numberOfClients":2,"connCfg":{"rntbd":null,"gw":"(cps:1000, rto:PT5S, icto:PT1M, p:false)","other":"(ed: true, cs: false)"},
"consistencyCfg":"(consistency: null, mm: true, prgns: [southcentralus,westus])"}}, 

causeInfo=[class: class io.netty.handler.timeout.ReadTimeoutException, message: null], responseHeaders={x-ms-substatus=10002}, requestHeaders=[Accept=application/json, x-ms-date=Fri, 25 Jun 2021 05:31:17 GMT, x-ms-documentdb-partitionkey=["xxxxx"], x-ms-documentdb-is-upsert=true, Content-Type=application/json]}

Stacktrace does not show much info, I can attach here if needed. In all these cases, we are seeing only one response header in the log -

responseHeaders={x-ms-substatus=10002}

We have below questions on this -

Note:. We are not using Gremlin API and connecting in Gateway Mode AND we are seeing these errors very less, may be less than 0.05% of the total Cosmos requests

Upvotes: 2

Views: 1824

Answers (1)

Matias Quaranta
Matias Quaranta

Reputation: 15603

Substatus 10002 is related to HTTP timeouts. Since you are on Gateway mode, it makes sense that your operations are executing through HTTP.

When designing or coding a distributed application you always need to account for timeouts. The SDK does retry on timeouts (based on https://learn.microsoft.com/azure/cosmos-db/troubleshoot-java-sdk-v4-sql#retry-logic-) but applications should always have some layer of timeout retries, since they can happen for various reasons (network blips, sporadic resource contention).

You mentioned that you see this less than 0.05% of total requests, which is not bad, but it's worth checking the areas in the guide for Java timeouts: https://learn.microsoft.com/azure/cosmos-db/troubleshoot-request-timeout-java-sdk-v4-sql

Timeouts could also happen due to a service side latency spike (it's not likely but it can happen), and this is one area that the support team you mentioned might help. From the diagnostics, it looks like you have a request timeout of 5 seconds, so any Gateway requests with more than 5 second latency would generate one .

Upvotes: 2

Related Questions