Reputation: 477
I am learning Spark fundamentals and in order to test my Pyspark application created an EMR instance with Spark, Yarn, Hadoop, Oozie on AWS. I am successfully able to execute a simple pyspark application from the driver node using spark-submit. I have the default /etc/spark/conf/spark-default.conf file created by AWS which is using Yarn Resource Manager. Everything runs fine and I can monitor the Tracking URL as well. But I am not able to differentiate between whether the spark job is running in 'client' mode or 'cluster' mode. How do I determine that?
Excerpts from /etc/spark/conf/spark-default.conf
spark.master yarn
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
spark.sql.warehouse.dir hdfs:///user/spark/warehouse
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.yarn.historyServer.address ip-xx-xx-xx-xx.ec2.internal:18080
spark.history.ui.port 18080
spark.shuffle.service.enabled true
spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions
spark.executor.memory 4743M
spark.executor.cores 2
spark.yarn.executor.memoryOverheadFactor 0.1875
spark.driver.memory 2048M
Excerpts from my pypspark job:
import os.path
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from boto3.session import Session
conf = SparkConf().setAppName('MyFirstPySparkApp')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", SECRET_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
....# access S3 bucket
....
....
Is there a deployment mode called 'yarn-client' or is it just 'client' and 'cluster'? Also, why is "num-executors" not specified in the config file by AWS? Is that something I need to add?
Thanks
Upvotes: 3
Views: 4256
Reputation: 4698
By default spark application runs in client mode, i.e. driver runs on the node where you're submitting the application from. Details about these deployment configurations can be found here. One easy to verify it would be to kill the running process by pressing ctrl + c on terminal after the job goes to RUNNING state. If it's running on client mode, the app would die. If it's running in cluster mode it would continue to run, because the driver is running in one of the worker nodes in EMR cluster. A sample spark-submit command to run the job in cluster mode would be
spark-submit --master yarn \
--py-files my-dependencies.zip \
--num-executors 10 \
--executor-cores 2 \
--executor-memory 5g \
--name sample-pyspark \
--deploy-mode cluster \
package.pyspark.main
By default number of executors is set to 1. You can check the default values for all spark configs here.
Upvotes: 2
Reputation: 13551
It is determined by how you send the option when you submit the job, see the Documentation.
Once you access to the spark history server from the EMR console or by web server, you can find the spark.submit.deployMode
option in the Environment tab. In my case, it is client mode.
Upvotes: 5