Unable to run spark application on EMR using cluster mode

I have a spark application, which I am trying to run on amazon EMR. But my application fails or goes to running mode and never quits, The same code is working on local machine in 2-3 mins. I suspect some issue with the way I'm creating spark session, My master conf is below

val spark = SparkSession.builder
  .master("local[2]")
  .appName("Graph Creation")
  .config("spark.sql.warehouse.dir", "warehouse")
  .config("spark.sql.shuffle.partitions", "1")
  .getOrCreate()

How can I build spark session so that it runs both on my local machine as well amazon EMR without issue

Upvotes: 0

Answers (1)

Yuriy Bondaruk

Reputation: 4750

It's better not to use local master URL in EMR cluster since you won't benefit from using slave nodes. Local means that spark will run locally on the system where it is launched and won't try to use other nodes in the cluster. The main purpose of local is local testing and whenever you want to run in a cluster you should choose a resource manager (yarn, mesos, spark-standalone or Kubernetes cluster, see here for more details).

You can provide the master URL as argument to spark-submit command so that if you run it locally you pass 'local' and for EMR cluster pass 'yarn', for example.

val spark = SparkSession.builder
  .appName("Graph Creation")
  .config("spark.sql.warehouse.dir", "warehouse")
  .config("spark.sql.shuffle.partitions", "1")
  .getOrCreate()

And then locally:

./bin/spark-submit --master local[2] ...

On EMR:

./bin/spark-submit --master yarn ...

Upvotes: 3

Unable to run spark application on EMR using cluster mode

Answers (1)

Related Questions