Reputation: 1437
I have a spark cluster running in EMR. I also have a jupyter notebook running on a second EC2 machine. I would like to use spark on my EC2 instance through jupyter. I'm looking for references on how to configure spark to access the EMR cluster from EC2. Searching gives me only guides on how to setup spark on either EMR or EC2, but not how to access one from the other.
I saw a similar question here:
Sending Commands from Jupyter/IPython running on EC2 to EMR cluster
However, the setup there uses a bootstrap action to setup zeppelin, and I'm not sure how to edit my hadoop configuration on EC2.
Upvotes: 0
Views: 537
Reputation: 31
To configuring SparkSession in an EC2 Jupyter Notebook to connect with the EMR (6.x.x) Spark master, do the following
# The version assumed here is 3.5.0
sudo /usr/lib/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0
# OR
sudo /usr/lib/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0 --conf spark.sql.catalogImplementation=hive ----conf spark.hadoop.hive.metastore.uris=thrift://<EMR-MASTER-NODE-PRIVATE-IP>:9083
# Replace <EMR-MASTER-NODE-PRIVATE-IP> with your EMR master node's private IP address
# Check pyspark version in EMR (Login into EMR server as hadoop user.)
pyspark --version
# The same version will be referred in the next steps
On the EC2 instance with Jupyter Notebook:
/opt/spark
or appropriate directoryexport SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$SPARK_HOME/sbin:$PATH
pip3 install pyspark==3.5.0
In the EC2 Jupyter Notebook, use the below code to connect:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.remote("sc://<EMR-MASTER-NODE-PRIVATE-IP>:15002") \
.config("spark.sql.catalogImplementation", "hive") \
.config("spark.hadoop.fs.s3.impl", "com.amazon.ws.emr.hadoop.fs.EmrFileSystem") \
.config("spark.sql.session.localRelationCacheThreshold", "1000") \
.getOrCreate()
```
# Replace <EMR-MASTER-NODE-PRIVATE-IP> with the EMR master node's private IP address.
print("spark master - ",spark.conf.get("spark.master"))
print("spark executer memory - ",spark.conf.get("spark.executor.memory"))
print("spark driver memory - ",spark.conf.get("spark.driver.memory"))
print("spark no of cores - ",spark.conf.get("spark.executor.cores"))
Upvotes: 0
Reputation: 755
This is quite late but would help people looking for the solution in future.
Solution here would be to copy hadoop, spark and hive configurations files from EMR cluster nodes to EC2 machine and place them at corresponding config locations for each (sample config file should already be present in the location similar to /etc/hadoop/conf). Now your ec2 machine would start using EMR node as the master node for all it's jobs.
If you face any DNS identification problem, replace all occurrences of DNS name of master node with actual IP or add a entry for it in /etc/hosts file to make it identifiable from ec2 machine.
sudo scp -i sample.pem /etc/hadoop/conf/ ec2-user@some_ip:/home/ec2-user/spark/hadoop/conf
sudo scp -i sample.pem /etc/hive/conf/ ec2-user@some_ip:/home/ec2-user/spark/hive/conf
sudo scp -i sample.pem /etc/spark/conf/ ec2-user@some_ip:/home/ec2-user/spark/spark/conf
Now place them at corresponding location using sudo copy command
Upvotes: 1
Reputation: 348
You can use EMR notebooks which does exactly what you are looking for. It sits outside the cluster and you can attach to any EMR cluster of your choice.
More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html
You can also add any Python dependencies which your Pyspark job needs from within the notebook. Those will be available on EMR cluster and isolated at your own notebook session.
More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html
Upvotes: 0
Reputation: 103
The right way to do it is run your jupyter in the master node (ec2 instance assigned as the master) and submit your spark application there.
Upvotes: 0