Reputation: 113
I have deployed a 3-node AWS ElasticMapReduce cluster bootstrapped with Apache Spark. From my local machine, I can access the master node by SSH:
ssh -i <key> [email protected]
Once ssh'd into the master node, I can access PySpark via pyspark
.
Additionally, (although insecure) I have configured my master node's security group to accept TCP traffic from my local machine's IP address specifically on port 7077
.
However, I am still unable to connect my local PySpark instance to my cluster:
MASTER=spark://ec2-master-node-public-address:7077 ./bin/pyspark
The above command results in a number of exceptions and causes PySpark to unable to initialize a SparkContext object.
Does anyone know how to successfully create a remote connection like the one I am describing above?
Upvotes: 11
Views: 2321
Reputation: 140
I have done something similar where in I connected the spark installed in an ec2 machine to the Master node of a Hadoop cluster.
Make sure the access from ec2 to Hadoop master node is properly configured
import os
from pyspark.sql import SparkSession
os.environ['HADOOP_CONF_DIR']='/etc/hadoop/hadoop/etc/hadoop'
os.environ['YARN_CONF_DIR']='/etc/hadoop/hadoop/etc/hadoop'
spark = SparkSession.builder \
.appName("MySparkApp") \
.master("yarn") \
.config("spark.hadoop.fs.defaultFS", "<master_ip>:9000") \
.config("spark.hadoop.yarn.resourcemanager.address", "<master_ip>:8040") \
.config("spark.hadoop.yarn.resourcemanager.scheduler.address", "<master_ip>:8030") \
.getOrCreate()
Upvotes: 0
Reputation: 40360
Unless your local machine is the master node for your cluster, you cannot do that. You won't be able to do that with AWS EMR.
Upvotes: -1