Soubhik
Soubhik

Reputation: 113

How can I connect PySpark (local machine) to my EMR cluster?

I have deployed a 3-node AWS ElasticMapReduce cluster bootstrapped with Apache Spark. From my local machine, I can access the master node by SSH:

ssh -i <key> [email protected] Once ssh'd into the master node, I can access PySpark via pyspark. Additionally, (although insecure) I have configured my master node's security group to accept TCP traffic from my local machine's IP address specifically on port 7077.

However, I am still unable to connect my local PySpark instance to my cluster:

MASTER=spark://ec2-master-node-public-address:7077 ./bin/pyspark

The above command results in a number of exceptions and causes PySpark to unable to initialize a SparkContext object.

Does anyone know how to successfully create a remote connection like the one I am describing above?

Upvotes: 11

Views: 2321

Answers (2)

visuman
visuman

Reputation: 140

I have done something similar where in I connected the spark installed in an ec2 machine to the Master node of a Hadoop cluster.

Make sure the access from ec2 to Hadoop master node is properly configured

import os
from pyspark.sql import SparkSession
os.environ['HADOOP_CONF_DIR']='/etc/hadoop/hadoop/etc/hadoop'
os.environ['YARN_CONF_DIR']='/etc/hadoop/hadoop/etc/hadoop'
spark = SparkSession.builder \
  .appName("MySparkApp") \
  .master("yarn") \
  .config("spark.hadoop.fs.defaultFS", "<master_ip>:9000") \
  .config("spark.hadoop.yarn.resourcemanager.address", "<master_ip>:8040") \
  .config("spark.hadoop.yarn.resourcemanager.scheduler.address", "<master_ip>:8030") \
  .getOrCreate()

Upvotes: 0

eliasah
eliasah

Reputation: 40360

Unless your local machine is the master node for your cluster, you cannot do that. You won't be able to do that with AWS EMR.

Upvotes: -1

Related Questions