Reputation: 106
I am trying to run a simple spark job via spark-shell and it looks like BlockManager for the spark-shell listens on localhost instead of configured IP address which causes the spark job to fail. The exception thrown is "Failed to connect to localhost" .
Here is the my configuration:
Machine 1(ubunt64): Spark master [192.168.253.136]
Machine 2(ubuntu64server): Spark Slave [192.168.253.137]
Machine 3(ubuntu64server2): Spark Shell Client[192.168.253.138]
Spark Version: spark-1.3.0-bin-hadoop2.4 Environment: Ubuntu 14.04
Source Code to be executed in Spark Shell:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
var conf = new SparkConf().setMaster("spark://192.168.253.136:7077")
conf.set("spark.driver.host","192.168.253.138")
conf.set("spark.local.ip","192.168.253.138")
sc.stop
var sc = new SparkContext(conf)
val textFile = sc.textFile("README.md")
textFile.count()
The above code just works file if I run it on Machine 2 where the slave is running, but it fails on Machine 1 (Master) and Machine 3(Spark Shell).
Not sure why spark shell listens on a localhost instead of configured IP address. I have set SPARK_LOCAL_IP on Machine 3 using spark-env.sh as well in .bashrc (export SPARK_LOCAL_IP=192.168.253.138). I confirmed that spark shell java program does listen on the port 44015. Not sure why spark shell is broadcasting localhost address.
Any help to troubleshoot this issue will be highly appreciated. Probably I am missing some configuration setting.
Logs:
scala> val textFile = sc.textFile("README.md")
15/04/22 18:15:22 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975
15/04/22 18:15:22 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB)
15/04/22 18:15:22 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975
15/04/22 18:15:22 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB)
15/04/22 18:15:22 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:44015 (size: 22.2 KB, free: 267.2 MB)
scala> textFile.count()
15/04/22 18:16:07 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (README.md MapPartitionsRDD[1] at textFile at :25)
15/04/22 18:16:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/04/22 18:16:08 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ubuntu64server, PROCESS_LOCAL, 1326 bytes)
15/04/22 18:16:23 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ubuntu64server, PROCESS_LOCAL, 1326 bytes)
15/04/22 18:16:23 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ubuntu64server): java.io.IOException: Failed to connect to localhost/127.0.0.1:44015 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Upvotes: 3
Views: 7785
Reputation: 106
Found a work-around for this BlockManager localhost issue by providing spark master address at shell initiation (or can bein spark-defaults.conf).
./spark-shell --master spark://192.168.253.136:7077
This way, I didn't have to stop the spark context and the original context was able to read files as well as read data from Cassandra tables.
Here is the log of BlockManager listening on localhost (stop and dynamic creation of context) which fails with "Failed to connect exception"
15/04/25 07:10:27 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:40235 (size: 1966.0 B, free: 267.2 MB)
compare to listening on actual server name (if spark master provided at command line) which works
15/04/25 07:12:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ubuntu64server2:33301 (size: 1966.0 B, free: 267.2 MB)
Looks like a bug in BlockManager code when context is dynamically created in the shell.
Hope this helps someone.
Upvotes: 2