gregvonbautt
gregvonbautt

Reputation: 93

Extremely high latencies on Hazelcast distributed map get operations

We use a distributed Hazelcast map in our project, and have recently run into a problem with extremely high get latencies. We used IMap.get(...) calls, which in some cases took hours to complete. After this incident, we switched to IMap.getAsync(...) API with timeouts, which resolves the problem for us, but I am curious if anybody had a similar issue.

Our Hazelcast version is 3.9.0. During the incident we had hazelcast.operation.call.timeout.millis set to 5000, and async-backup-count="3" with read-backup-data="true". We also had sporadic CPU usage spikes (up to 100% for several minutes) on some hosts due to unrelated background processing, which could potentially affect Hazelcast.

The only suspicious thing that we found in the logs was that around the incident all hosts were complaing about one particular host, like this:

Oct 24, 2018 3:53:01 PM com.hazelcast.spi.impl.operationservice.impl.InboundResponseHandler
WARNING: [hostX]:5702 [dev] [3.9] No Invocation found for call timeout response with callId739863 sent from [hostY]:5702
Oct 24, 2018 3:53:01 PM com.hazelcast.spi.impl.operationservice.impl.InboundResponseHandler
WARNING: [hostX]:5702 [dev] [3.9] No Invocation found for call timeout response with callId739864 sent from [hostY]:5702
Oct 24, 2018 3:53:01 PM com.hazelcast.spi.impl.operationservice.impl.InboundResponseHandler
WARNING: [hostX]:5702 [dev] [3.9] No Invocation found for call timeout response with callId739852 sent from [hostY]:5702
Oct 24, 2018 3:53:01 PM com.hazelcast.spi.impl.operationservice.impl.InboundResponseHandler
WARNING: [hostX]:5702 [dev] [3.9] No Invocation found for call timeout response with callId739870 sent from [hostY]:5702
Oct 24, 2018 3:53:01 PM com.hazelcast.spi.impl.operationservice.impl.InboundResponseHandler
WARNING: [hostX]:5702 [dev] [3.9] No Invocation found for call timeout response with callId739874 sent from [hostY]:5702
Oct 24, 2018 3:53:01 PM com.hazelcast.spi.impl.operationservice.impl.InboundResponseHandler

In the hostY's logs:

Oct 24, 2018 3:53:01 PM com.hazelcast.spi.impl.operationservice.impl.InvocationMonitor
WARNING: [hostY]:5702 [dev] [3.9] MonitorInvocationsTask delayed 14294 ms
Oct 24, 2018 3:53:01 PM com.hazelcast.spi.impl.operationservice.impl.InvocationMonitor
WARNING: [hostY]:5702 [dev] [3.9] BroadcastOperationControlTask delayed 13544 ms

Any ideas?

Upvotes: 3

Views: 3982

Answers (1)

mdogan
mdogan

Reputation: 1949

From the hostY's logs, it seems hostY suffering from GC pauses. MonitorInvocationsTask is scheduled to run every seconds but it says its execution is delayed for 14 seconds. And BroadcastOperationControlTask should be scheduled nearly every second due to your config (hazelcast.operation.call.timeout.millis / 4 = 1250 ms) but similarly it's delayed for 13 seconds.

You can verify this by enabling GC logs. Additionally Hazelcast should print HealthMonitor logs periodically when memory and/or CPU usage exceeds some threshold.

Upvotes: 3

Related Questions