Aditya Mavinasara
Aditya Mavinasara

Reputation: 11

Apache Ignite 2.12 fails to query after couple of days

In our application we are using the Apache ignite 2.12 version and the same is deployed using kubernetes. We use binary object approach and everything works fine but after couple of days we are not able to query the cache and below are the errors/exception.

Any help is appreciated.

Errors in the pod:

Thread [name="query-#1741%ignite-service%", id=2004, state=TIMED_WAITING, blockCnt=0, waitCnt=29]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3f30bea2, ownerName=null, ownerId=-1]
        at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
        at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
        at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
        at [email protected]/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:458)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1053)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at [email protected]/java.lang.Thread.run(Thread.java:829)

Thread [name="Connection evictor", id=2021, state=TIMED_WAITING, blockCnt=0, waitCnt=123]
        at [email protected]/java.lang.Thread.sleep(Native Method)
        at org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
        at [email protected]/java.lang.Thread.run(Thread.java:829)

Thread [name="sys-#1773%ignite-service%", id=2048, state=TIMED_WAITING, blockCnt=0, waitCnt=1]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4cd18e1b, ownerName=null, ownerId=-1]
        at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
        at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
        at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
        at [email protected]/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:458)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1053)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at [email protected]/java.lang.Thread.run(Thread.java:829)

2022-11-15 04:22:20.783  WARN [app-service,,] 1 --- [otlist-service%] o.a.i.i.p.cache.CacheDiagnosticManager   : Page locks dump:

2022-11-15 04:22:29.834 ERROR [app-service,,] 1 --- [otlist-service%] o.apache.ignite.internal.util.typedef.G  : Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=disco-event-worker, threadName=disco-event-worker-#44%ignite-service%, blockedFor=834s]
2022-11-15 04:22:29.837  WARN [app-service,,] 1 --- [otlist-service%]                                          : Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=disco-event-worker, igniteInstanceName=ignite-service, finished=false, heartbeatTs=1668485315026]]]

org.apache.ignite.IgniteException: GridWorker [name=disco-event-worker, igniteInstanceName=ignite-service, finished=false, heartbeatTs=1668485315026]
    at [email protected]/jdk.internal.misc.Unsafe.park(Native Method) ~[na:na]
    at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) ~[na:na]
    at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.IgniteKernal.resetLostPartitions(IgniteKernal.java:3797) ~[ignite-core-2.12.0.jar:2.12.0]
    at com.app.dao.ignite.IgniteProvider.lambda$resetLostPartition$3(IgniteProvider.java:267) ~[app-core-2022.4.2-RC1.jar:2022.4.2-RC1]
    at com.app.dao.ignite.IgniteProvider$$Lambda$2152/0x0000000800eab440.accept(Unknown Source) ~[na:na]
    at [email protected]/java.lang.Iterable.forEach(Iterable.java:75) ~[na:na]
    at com.app.dao.ignite.IgniteProvider.resetLostPartition(IgniteProvider.java:264) ~[app-core-2022.4.2-RC1.jar:2022.4.2-RC1]
    at com.app.dao.ignite.IgniteProvider.lambda$getOrStartIgniteNode$37431c07$1(IgniteProvider.java:103) ~[app-core-2022.4.2-RC1.jar:2022.4.2-RC1]
    at com.app.dao.ignite.IgniteProvider$$Lambda$1640/0x0000000800bce840.apply(Unknown Source) ~[na:na]
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager$UserListenerWrapper.onEvent(GridEventStorageManager.java:1492) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:894) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:879) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:350) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:313) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:3074) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:3291) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:3094) ~[ignite-core-2.12.0.jar:2.12.0]
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) ~[ignite-core-2.12.0.jar:2.12.0]
    at [email protected]/java.lang.Thread.run(Thread.java:829) ~[na:na]

Restarting the ignite pods is fixing the issue but the same cannot be done in higher environments.

Upvotes: 1

Views: 257

Answers (2)

Vladimir Pligin
Vladimir Pligin

Reputation: 1660

It looks like the stack trace of the "disco-event-worker" thread contains

org.apache.ignite.internal.IgniteKernal.resetLostPartitions(IgniteKernal.java:3797) ~[ignite-core-2.12.0.jar:2.12.0]

You are trying to reset lost partitions in your lambda com.app.dao.ignite.IgniteProvider.lambda$resetLostPartition$3 which is in turn a local listener registered via IgniteEvents#localListen.

This thread is dedicated to listening to discovery events like EVT_NODE_LEFT or EVT_NODE_FAILED. My assumption is that you are trying to reset lost partitions when you observe a node leaving a cluster.

This is not the right way of handling a partition loss. You should configure a proper policy suiting your case and reset via external tool or API externally.

There's no full thread dump but I'm pretty confident you have a deadlock as resetting lost partitions requires a cluster wide operation though discovery which is blocked by the request on the node in question.

Upvotes: 2

Stephen Darlington
Stephen Darlington

Reputation: 52565

It's difficult to be sure what the underlying cause is, but Ignite is blocked for some reason (blockedFor=834s).

A few ideas:

  • If you see "Long JVM pause" messages in the logs, you likely need to tune your garbage collector
  • If it slows down as memory usage increases, it's likely that your memory configuration is incorrect (i.e., your in-memory database is starting to swap to disk)
  • If you also see the striped pool blocked and you have enabled persistence, your disks may not be fast enough
  • You may not be allocating enough resources to Ignite. Do you have resource requests in your stateful set configuration?

Upvotes: 0

Related Questions