Apache spark when ran with master is throwing error continuously

I am using Apache Spark in Cluster mode with a master node and 3 slaves(All these 4 machines are different Vm's in same Lan). After successful configurations, i am able to see my worker nodes and master node in the spark web ui

I am using Python 2.7 and spark 1.4.1

enter image description here

But the real problem is when i am trying to run spark with master(In this case i am using PySpark), it is continuously logging error in the python console. I am able to capture the log but didn't find any clue from those logs.

I am pasting my logs here for reference

ubuntu@MyCareerVM1:/usr/local/spark$ MASTER=spark://192.168.1.81:7077 bin/pyspark
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
16/03/29 09:16:48 INFO SparkContext: Running Spark version 1.4.1
16/03/29 09:16:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/29 09:16:49 INFO SecurityManager: Changing view acls to: ubuntu
16/03/29 09:16:49 INFO SecurityManager: Changing modify acls to: ubuntu
16/03/29 09:16:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
16/03/29 09:16:49 INFO Slf4jLogger: Slf4jLogger started
16/03/29 09:16:50 INFO Remoting: Starting remoting
16/03/29 09:16:50 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:34901]
16/03/29 09:16:50 INFO Utils: Successfully started service 'sparkDriver' on port 34901.
16/03/29 09:16:50 INFO SparkEnv: Registering MapOutputTracker
16/03/29 09:16:50 INFO SparkEnv: Registering BlockManagerMaster
16/03/29 09:16:50 INFO DiskBlockManager: Created local directory at /tmp/spark-a77016c9-a9ae-49c5-908f-fc540dc7d3ff/blockmgr-a9e868af-4253-4230-9227-948fbb8a0d91
16/03/29 09:16:50 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
16/03/29 09:16:50 INFO HttpFileServer: HTTP File server directory is /tmp/spark-a77016c9-a9ae-49c5-908f-fc540dc7d3ff/httpd-a78e633c-0ae7-46cf-81e8-776d8f7c3c46
16/03/29 09:16:50 INFO HttpServer: Starting HTTP Server
16/03/29 09:16:50 INFO Utils: Successfully started service 'HTTP file server' on port 34364.
16/03/29 09:16:50 INFO SparkEnv: Registering OutputCommitCoordinator
16/03/29 09:16:50 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/03/29 09:16:50 INFO SparkUI: Started SparkUI at http://173.220.132.82:4040
16/03/29 09:16:50 INFO AppClient$ClientActor: Connecting to master akka.tcp://[email protected]:7077/user/Master...
16/03/29 09:16:51 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160329091651-0006
16/03/29 09:16:51 INFO AppClient$ClientActor: Executor added: app-20160329091651-0006/0 on worker-20160329072744-192.168.1.84-45492 (192.168.1.84:45492) with 6 cores
16/03/29 09:16:51 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160329091651-0006/0 on hostPort 192.168.1.84:45492 with 6 cores, 512.0 MB RAM
16/03/29 09:16:51 INFO AppClient$ClientActor: Executor added: app-20160329091651-0006/1 on worker-20160329072744-192.168.1.82-45482 (192.168.1.82:45482) with 6 cores
16/03/29 09:16:51 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160329091651-0006/1 on hostPort 192.168.1.82:45482 with 6 cores, 512.0 MB RAM
16/03/29 09:16:51 INFO AppClient$ClientActor: Executor added: app-20160329091651-0006/2 on worker-20160329072746-192.168.1.83-38065 (192.168.1.83:38065) with 6 cores
16/03/29 09:16:51 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160329091651-0006/2 on hostPort 192.168.1.83:38065 with 6 cores, 512.0 MB RAM
16/03/29 09:16:51 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/2 is now LOADING
16/03/29 09:16:51 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/1 is now LOADING
16/03/29 09:16:51 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/0 is now LOADING
16/03/29 09:16:51 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/0 is now RUNNING
16/03/29 09:16:51 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/1 is now RUNNING
16/03/29 09:16:51 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/2 is now RUNNING
16/03/29 09:16:51 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42458.
16/03/29 09:16:51 INFO NettyBlockTransferService: Server created on 42458
16/03/29 09:16:51 INFO BlockManagerMaster: Trying to register BlockManager
16/03/29 09:16:51 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.81:42458 with 265.4 MB RAM, BlockManagerId(driver, 192.168.1.81, 42458)
16/03/29 09:16:51 INFO BlockManagerMaster: Registered BlockManager
16/03/29 09:16:51 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.4.1
      /_/

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkContext available as sc, HiveContext available as sqlContext.
>>> 16/03/29 09:16:53 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/0 is now EXITED (Command exited with code 1)
16/03/29 09:16:53 INFO SparkDeploySchedulerBackend: Executor app-20160329091651-0006/0 removed: Command exited with code 1
16/03/29 09:16:53 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
16/03/29 09:16:53 INFO AppClient$ClientActor: Executor added: app-20160329091651-0006/3 on worker-20160329072744-192.168.1.84-45492 (192.168.1.84:45492) with 6 cores
16/03/29 09:16:53 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160329091651-0006/3 on hostPort 192.168.1.84:45492 with 6 cores, 512.0 MB RAM
16/03/29 09:16:53 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/2 is now EXITED (Command exited with code 1)
16/03/29 09:16:53 INFO SparkDeploySchedulerBackend: Executor app-20160329091651-0006/2 removed: Command exited with code 1
16/03/29 09:16:53 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 2
16/03/29 09:16:53 INFO AppClient$ClientActor: Executor added: app-20160329091651-0006/4 on worker-20160329072746-192.168.1.83-38065 (192.168.1.83:38065) with 6 cores
16/03/29 09:16:53 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160329091651-0006/4 on hostPort 192.168.1.83:38065 with 6 cores, 512.0 MB RAM
16/03/29 09:16:53 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/3 is now LOADING
16/03/29 09:16:53 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/4 is now LOADING
16/03/29 09:16:53 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/3 is now RUNNING
16/03/29 09:16:53 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/4 is now RUNNING
16/03/29 09:16:54 INFO AppClient$ClientActor: Executor updated: app-20160329091651-0006/1 is now EXITED (Command exited with code 1)
16/03/29 09:16:54 INFO SparkDeploySchedulerBackend: Executor app-20160329091651-0006/1 removed: Command exited with code 1
16/03/29 09:16:54 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1

I tried to find the logs from slave from below location: /usr/local/spark/work/

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/03/30 01:50:18 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/03/30 01:50:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/30 01:50:19 INFO SecurityManager: Changing view acls to: ubuntu
16/03/30 01:50:19 INFO SecurityManager: Changing modify acls to: ubuntu
16/03/30 01:50:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
16/03/30 01:50:20 INFO Slf4jLogger: Slf4jLogger started
16/03/30 01:50:20 INFO Remoting: Starting remoting
16/03/30 01:50:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:38333]
16/03/30 01:50:20 INFO Utils: Successfully started service 'driverPropsFetcher' on port 38333.
16/03/30 01:50:20 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://[email protected]:34047] has failed, address is now gated for [5000] ms. Reason is: [Association failed$
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://[email protected]:34047/), Path(/user/CoarseGrainedScheduler)]
        at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
        at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
        at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
        at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
        at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
        at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
        at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
        at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
        at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
        at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
        at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
        at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
        at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
        at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
        at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
        at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
        at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
        at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
        at akka.actor.ActorCell.terminate(ActorCell.scala:369)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
16/03/30 01:50:20 INFO Utils: Shutdown hook called

So any quick suggestions would help me out.

Upvotes: 0

Views: 743

Answers (2)

I was able to solve this at last. This is happening because of some ip config rules.

I was able to solve this by running below command to allow specific port sudo iptables -I INPUT 1 -p tcp --dport 5678 -j ACCEPT

As Spark runs on random port, we can use to whitelist all ports sudo iptables -I INPUT -j ACCEPT

Thanks, Pavan(http://pavanarya.wordpress.com)

Upvotes: 1

Abhishek Anand
Abhishek Anand

Reputation: 1992

I have faced exact same issue in our Spark cluster.

This issue usually happens when you run a very intensive app on spark, and try to run another app after that.

Thing is, one of the executor becomes non-responsive after intensive running from previous run, but spark keeps poking it. Not sure why spark keeps on doing this in Standalone and Cluster mode, because this doesn't happen in yarn mode. Ideal scenario, Spark was built with robustness to keep running job even if one executor/worker fails.

To fix this, you need to restart the workers and executors which have gone non-responsive. Your app will start running. If you are not able to find which executor, just restart all of them and master. It will fix this.

Upvotes: 0

Related Questions