Reputation: 11
In our application we are using the Apache ignite 2.12 version and the same is deployed using kubernetes. We use binary object approach and everything works fine but after couple of days we are not able to query the cache and below are the errors/exception.
Any help is appreciated.
Errors in the pod:
Thread [name="query-#1741%ignite-service%", id=2004, state=TIMED_WAITING, blockCnt=0, waitCnt=29]
Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3f30bea2, ownerName=null, ownerId=-1]
at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
at [email protected]/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:458)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1053)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at [email protected]/java.lang.Thread.run(Thread.java:829)
Thread [name="Connection evictor", id=2021, state=TIMED_WAITING, blockCnt=0, waitCnt=123]
at [email protected]/java.lang.Thread.sleep(Native Method)
at org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
at [email protected]/java.lang.Thread.run(Thread.java:829)
Thread [name="sys-#1773%ignite-service%", id=2048, state=TIMED_WAITING, blockCnt=0, waitCnt=1]
Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4cd18e1b, ownerName=null, ownerId=-1]
at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
at [email protected]/java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:458)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1053)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at [email protected]/java.lang.Thread.run(Thread.java:829)
2022-11-15 04:22:20.783 WARN [app-service,,] 1 --- [otlist-service%] o.a.i.i.p.cache.CacheDiagnosticManager : Page locks dump:
2022-11-15 04:22:29.834 ERROR [app-service,,] 1 --- [otlist-service%] o.apache.ignite.internal.util.typedef.G : Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=disco-event-worker, threadName=disco-event-worker-#44%ignite-service%, blockedFor=834s]
2022-11-15 04:22:29.837 WARN [app-service,,] 1 --- [otlist-service%] : Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=disco-event-worker, igniteInstanceName=ignite-service, finished=false, heartbeatTs=1668485315026]]]
org.apache.ignite.IgniteException: GridWorker [name=disco-event-worker, igniteInstanceName=ignite-service, finished=false, heartbeatTs=1668485315026]
at [email protected]/jdk.internal.misc.Unsafe.park(Native Method) ~[na:na]
at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) ~[na:na]
at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.IgniteKernal.resetLostPartitions(IgniteKernal.java:3797) ~[ignite-core-2.12.0.jar:2.12.0]
at com.app.dao.ignite.IgniteProvider.lambda$resetLostPartition$3(IgniteProvider.java:267) ~[app-core-2022.4.2-RC1.jar:2022.4.2-RC1]
at com.app.dao.ignite.IgniteProvider$$Lambda$2152/0x0000000800eab440.accept(Unknown Source) ~[na:na]
at [email protected]/java.lang.Iterable.forEach(Iterable.java:75) ~[na:na]
at com.app.dao.ignite.IgniteProvider.resetLostPartition(IgniteProvider.java:264) ~[app-core-2022.4.2-RC1.jar:2022.4.2-RC1]
at com.app.dao.ignite.IgniteProvider.lambda$getOrStartIgniteNode$37431c07$1(IgniteProvider.java:103) ~[app-core-2022.4.2-RC1.jar:2022.4.2-RC1]
at com.app.dao.ignite.IgniteProvider$$Lambda$1640/0x0000000800bce840.apply(Unknown Source) ~[na:na]
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager$UserListenerWrapper.onEvent(GridEventStorageManager.java:1492) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:894) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:879) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:350) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:313) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:3074) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:3291) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:3094) ~[ignite-core-2.12.0.jar:2.12.0]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) ~[ignite-core-2.12.0.jar:2.12.0]
at [email protected]/java.lang.Thread.run(Thread.java:829) ~[na:na]
Restarting the ignite pods is fixing the issue but the same cannot be done in higher environments.
Upvotes: 1
Views: 257
Reputation: 1660
It looks like the stack trace of the "disco-event-worker" thread contains
org.apache.ignite.internal.IgniteKernal.resetLostPartitions(IgniteKernal.java:3797) ~[ignite-core-2.12.0.jar:2.12.0]
You are trying to reset lost partitions in your lambda com.app.dao.ignite.IgniteProvider.lambda$resetLostPartition$3
which is in turn a local listener registered via IgniteEvents#localListen.
This thread is dedicated to listening to discovery events like EVT_NODE_LEFT
or EVT_NODE_FAILED
. My assumption is that you are trying to reset lost partitions when you observe a node leaving a cluster.
This is not the right way of handling a partition loss. You should configure a proper policy suiting your case and reset via external tool or API externally.
There's no full thread dump but I'm pretty confident you have a deadlock as resetting lost partitions requires a cluster wide operation though discovery which is blocked by the request on the node in question.
Upvotes: 2
Reputation: 52565
It's difficult to be sure what the underlying cause is, but Ignite is blocked for some reason (blockedFor=834s
).
A few ideas:
Upvotes: 0