I'm learning Spark and wanted to run the simplest possible cluster consisting of two physical machines. I've done all the basic setup and it seems to be fine. The output of the automatic start script looks as follows: [username@localhost sbin]$ ./start-all.sh starting org.apache.spark.deploy.master.Master, logging to /home/username/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.master.Master-1-localhost.out localhost: starting org.apache.spark.deploy.worker.Worker, logging to /home/sername/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.worker.Worker-1-localhost.out username@192.168.???.??: starting org.apache.spark.deploy.worker.Worker, logging to /home/username/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out so no errors here and seems that a Master node is running as well as two Worker nodes. However when I open the WebGUI at 192.168.???.??:8080, it only lists one worker - the local one. My issue is similar to that described here: Spark Clusters: worker info doesn't show on web UI but There's nothing going on in my /etc/hosts file. All it contains is: 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 What am I missing? Both machines are running Fedora Workstation x86_64.

Reputation: 311

Basic Spark example not working

I'm learning Spark and wanted to run the simplest possible cluster consisting of two physical machines. I've done all the basic setup and it seems to be fine. The output of the automatic start script looks as follows:

[username@localhost sbin]$ ./start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /home/username/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.master.Master-1-localhost.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /home/sername/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.worker.Worker-1-localhost.out
[email protected].???.??: starting org.apache.spark.deploy.worker.Worker, logging to /home/username/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out

so no errors here and seems that a Master node is running as well as two Worker nodes. However when I open the WebGUI at 192.168.???.??:8080, it only lists one worker - the local one. My issue is similar to that described here: Spark Clusters: worker info doesn't show on web UI but There's nothing going on in my /etc/hosts file. All it contains is:

127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6

What am I missing? Both machines are running Fedora Workstation x86_64.

Upvotes: 10

Answers (3)

dsncode

Reputation: 2441

it seems like spark is very picky about IP and machine names. so, when starting your master, it will use your machine name to register spark master. if that name is not reachable from your workers, it will be almost impossible to reach.

a workaround is to start your master like this

SPARK_MASTER_IP=YOUR_SPARK_MASTER_IP ${SPARK_HOME}/sbin/start-master.sh

then, you will be able to connect your slaves like this

${SPARK_HOME}/sbin/start-slave.sh spark://**YOUR_SPARK_MASTER_IP**:PORT

and there you go!

Upvotes: 4

Salim

Reputation: 2178

I had similar issue which got resolved by providing SPARK_MASTER_IP in $SPARK_HOME/conf/spark-env.sh. spark-env.sh essentially sets an environment variable SPARK_MASTER_IP which points to an IP to be tied to Master. Then start-master.sh reads this variable and ties Master to it. Now SPARK_MASTER_IP is visible outside the box where Master is running.

Upvotes: 0

zero323

Reputation: 330453

Basically the source of the problems is that the master hostname resolves to the localhost. It is visible in both console output:

starting org.apache.spark.deploy.master.Master, logging to 
/home/.../spark-username-org.apache.spark.deploy.master.Master-1-localhost.out

where the last part corresponds to the hostname. You can see the same behavior in master log:

16/02/17 11:13:54 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.128.224 instead (on interface eno1)

and remote worker logs:

16/02/17 11:13:58 WARN Worker: Failed to connect to master localhost:7077
java.io.IOException: Failed to connect to localhost/127.0.0.1:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:7077

It means that remote worker tries to access a master on localhost and obviously fails. Even if worker was able to connect to the master I wouldn't work in a reverse direction for the same reason.

Some way to solve this problem:

provide a proper network configuration for both workers and master to ensure that hostnames used by each machine can be properly resolved to the corresponding IP addresses.
use ssh tunnels to forward all required ports between remote workers and master.

Upvotes: 5

Basic Spark example not working

Answers (3)

Related Questions