Setting up Jupyter Pyspark to work between EC2 and EMR

I have a spark cluster running in EMR. I also have a jupyter notebook running on a second EC2 machine. I would like to use spark on my EC2 instance through jupyter. I'm looking for references on how to configure spark to access the EMR cluster from EC2. Searching gives me only guides on how to setup spark on either EMR or EC2, but not how to access one from the other.

I saw a similar question here:

Sending Commands from Jupyter/IPython running on EC2 to EMR cluster

However, the setup there uses a bootstrap action to setup zeppelin, and I'm not sure how to edit my hadoop configuration on EC2.

Upvotes: 0

Answers (4)

Vishal

Reputation: 31

To configuring SparkSession in an EC2 Jupyter Notebook to connect with the EMR (6.x.x) Spark master, do the following

On the EMR cluster, start the Spark Connect server:

# The version assumed here is 3.5.0
sudo /usr/lib/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0
# OR 
sudo /usr/lib/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0 --conf spark.sql.catalogImplementation=hive ----conf spark.hadoop.hive.metastore.uris=thrift://<EMR-MASTER-NODE-PRIVATE-IP>:9083 
# Replace <EMR-MASTER-NODE-PRIVATE-IP> with your EMR master node's private IP address

# Check pyspark version in EMR (Login into EMR server as hadoop user.) 
pyspark --version
# The same version will be referred in the next steps

On the EC2 instance with Jupyter Notebook:
- Download and extract Spark 3.5.0 [ https://spark.apache.org/downloads.html ]
- Move unpacked content to /opt/spark or appropriate directory
- Set environment variables
```
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$SPARK_HOME/sbin:$PATH
```
- Install required packages: pandas, pyarrow, grpcio, protobuf
- Install the same pyspark version as the one noted in the above for EMR
```
pip3 install pyspark==3.5.0
```
In the EC2 Jupyter Notebook, use the below code to connect:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
 .remote("sc://<EMR-MASTER-NODE-PRIVATE-IP>:15002") \
 .config("spark.sql.catalogImplementation", "hive") \
 .config("spark.hadoop.fs.s3.impl", "com.amazon.ws.emr.hadoop.fs.EmrFileSystem") \
 .config("spark.sql.session.localRelationCacheThreshold", "1000") \
 .getOrCreate()
      ```
# Replace <EMR-MASTER-NODE-PRIVATE-IP> with the EMR master node's private IP address.

Verify the spark configuration with the below in ec2 jupyter.

print("spark master          - ",spark.conf.get("spark.master"))
print("spark executer memory - ",spark.conf.get("spark.executor.memory"))
print("spark driver memory   - ",spark.conf.get("spark.driver.memory"))
print("spark no of cores     - ",spark.conf.get("spark.executor.cores"))

Upvotes: 0

Mousam Singh

Reputation: 755

This is quite late but would help people looking for the solution in future.

Solution here would be to copy hadoop, spark and hive configurations files from EMR cluster nodes to EC2 machine and place them at corresponding config locations for each (sample config file should already be present in the location similar to /etc/hadoop/conf). Now your ec2 machine would start using EMR node as the master node for all it's jobs.

If you face any DNS identification problem, replace all occurrences of DNS name of master node with actual IP or add a entry for it in /etc/hosts file to make it identifiable from ec2 machine.

sudo scp -i sample.pem /etc/hadoop/conf/ ec2-user@some_ip:/home/ec2-user/spark/hadoop/conf
sudo scp -i sample.pem /etc/hive/conf/ ec2-user@some_ip:/home/ec2-user/spark/hive/conf
sudo scp -i sample.pem /etc/spark/conf/ ec2-user@some_ip:/home/ec2-user/spark/spark/conf

Now place them at corresponding location using sudo copy command

Upvotes: 1

Parag Chaudhari

Reputation: 348

You can use EMR notebooks which does exactly what you are looking for. It sits outside the cluster and you can attach to any EMR cluster of your choice.

More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html

You can also add any Python dependencies which your Pyspark job needs from within the notebook. Those will be available on EMR cluster and isolated at your own notebook session.

More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html

Upvotes: 0

Bikash Joshi

Reputation: 103

The right way to do it is run your jupyter in the master node (ec2 instance assigned as the master) and submit your spark application there.

Upvotes: 0

Setting up Jupyter Pyspark to work between EC2 and EMR

Answers (4)

Related Questions