Reputation: 543
It's similar to this question Albeit I'm running the executable jar locally (with java -jar command). I use 180G for the jvm heap max size and the exception was thrown at about 110G of usage after about 37 minutes
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id someId timed out.
In the question above it is mentioned "maybe the membory of JobManager is too small". How is this translated to local execution? Is it just memory that I configure with Xmx like I wrote above (180G)?
I read somewhere else that I can configure the heartbeat interval of the taskmanager? How do I do that in local execution?
(EDIT) Where can I find the logs of the task manager in local execution?
I'm using flink1.9 by the way
EDITED logs:
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.runtime.jobmaster.JobMaster - Trigger heartbeat request.
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.runtime.jobmaster.JobMaster - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:28:33.993 [flink-akka.actor.default-dispatcher-530] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 966e2f0da4aa05748d6c7921fb02819e.
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 966e2f0da4aa05748d6c7921fb02819e.
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.runtime.jobmaster.JobMaster - Received heartbeat from 0895b99a-27b3-43da-8226-16031db924de.
18:28:33.993 [flink-akka.actor.default-dispatcher-530] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 0895b99a-27b3-43da-8226-16031db924de.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.r.r.s.SlotManagerImpl - Received slot report from instance 899db08270ab48abff5e3111979c6f07: SlotReport{slotsStatus=[SlotStatus{slotID=0895b99a-27b3-43da-8226-16031db924de_0, resourceProfile=ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647, managedMemoryInMB=165639}, allocationID=8b473e13eadfa5d9b8e34e9020536b5e, jobID=6109483ea411f1c099d0bb18a5995742}]}.
18:29:44.070 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.runtime.jobmaster.JobMaster - Trigger heartbeat request.
18:29:44.070 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:29:44.071 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 966e2f0da4aa05748d6c7921fb02819e.
18:29:44.071 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.runtime.jobmaster.JobMaster - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO o.a.f.r.r.StandaloneResourceManager - The heartbeat of JobManager with id 966e2f0da4aa05748d6c7921fb02819e timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.r.j.slotpool.SlotPoolImpl - Slot Pool Status:
status: connected to akka://flink/user/resourcemanager
registered TaskManagers: [0895b99a-27b3-43da-8226-16031db924de]
available slots: []
allocated slots: [[AllocatedSlot 8b473e13eadfa5d9b8e34e9020536b5e @ 0895b99a-27b3-43da-8226-16031db924de @ localhost (dataPort=-1) - 0]]
pending requests: []
}
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO o.a.f.r.r.StandaloneResourceManager - Disconnect job manager afeaa9d79956984a8a830ef4007a4315@akka://flink/user/jobmanager_1 for job 6109483ea411f1c099d0bb18a5995742 from the resource manager.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO o.a.f.r.r.StandaloneResourceManager - The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO o.a.f.r.r.StandaloneResourceManager - Closing TaskExecutor connection 0895b99a-27b3-43da-8226-16031db924de because: The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.s.SlotManagerImpl - Unregister TaskManager 899db08270ab48abff5e3111979c6f07 from the SlotManager.
18:29:44.071 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.runtime.jobmaster.JobMaster - Disconnect TaskExecutor 0895b99a-27b3-43da-8226-16031db924de because: Heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 966e2f0da4aa05748d6c7921fb02819e.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 0895b99a-27b3-43da-8226-16031db924de.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Received slot report from TaskManager 0895b99a-27b3-43da-8226-16031db924de which is no longer registered.
18:29:44.071 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Close ResourceManager connection 2ac3bc91e5eb2b7e41baf47d97764652.
java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1144)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
18:29:44.072 [flink-akka.actor.default-dispatcher-534] INFO o.a.f.r.e.ExecutionGraph - Map (1/1) (f6b46e7b79dc5c196c14072ff765354f) switched from RUNNING to FAILED.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
18:29:44.072 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - No open TaskExecutor connection 0895b99a-27b3-43da-8226-16031db924de. Ignoring close TaskExecutor connection. Closing reason was: The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.072 [flink-akka.actor.default-dispatcher-536] INFO o.a.f.r.taskexecutor.TaskExecutor - Connecting to ResourceManager akka://flink/user/resourcemanager(bcfecd2d9083fd495b6f80b0a30e4507).
18:29:44.072 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.rpc.akka.AkkaRpcService - Try to connect to remote RPC endpoint with address akka://flink/user/resourcemanager. Returning a org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway.
18:29:44.072 [flink-akka.actor.default-dispatcher-534] INFO o.a.f.r.e.ExecutionGraph - Job Flink Streaming Job (6109483ea411f1c099d0bb18a5995742) switched from state RUNNING to FAILING.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Upvotes: 3
Views: 13252