Dobob
Dobob

Reputation: 798

spark-submit using YARN master programmatically not working

I am using Apache Spark 2.1.0. If I do:

$ spark-submit --master yarn main.py

The Spark python module will execute on YARN properly and the application will show up on YARN web browser GUI as a finished application.

If I do it programmatically, it doesn't show up in YARN GUI, so I am assuming it doesn't end up using YARN as master:

from pyspark import SparkContext, SparkConf
import os

from pyspark.sql import *
from pyspark.sql.types import *

def read_cluster_file(file_path, spark, table_name):
    cluster_data = spark.read.csv(file_path, header=True, mode="DROPMALFORMED")    
    cluster_data.createOrReplaceTempView(table_name)

    return cluster_data

def main():
    spark = SparkSession.builder.master("yarn").appName("gene_cluster").getOrCreate()
    dir = os.path.dirname(__file__)
    cluster_data = read_cluster_file("file:"+dir+"/gene_cluster.csv", ",", spark, "cluster")
    result_df = spark.sql("SELECT `subunits(Entrez IDs)` FROM cluster")
    result_df.show()

if __name__ == '__main__':
    main()

How do I make my Spark application run with YARN master programmatically in Python?

I have tried:

Upvotes: 2

Views: 4364

Answers (1)

Srinivas Jill
Srinivas Jill

Reputation: 179

I was facing the same issue in HDP2.5 . I am using the SparkSession API and even though i set the master as 'yarn' , SparkContext is created in local mode and none of my yarn related configuration is effective. I also checked if there is any issue with cluster setup by submitting a sample application using spark-submit with below command.

spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi /usr/hdp/2.5.0.0-1245/spark/lib/spark-examples-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar 10

The job executed pretty well and from Spark WebUI i can clearly see it ran as a cluster job and executors are distributed among my worker nodes.

After digging a bit into spark code, it seems that using pyspark , the sc is created even with out taking the configuration into consideration and once created the configuration is applied .There are some specific configurations which can be effective even after the spark context is initialised but some has to be properly set at the time of initialisation. I finally managed to run the job by setting PYSPARK_SUBMIT_ARGS.

export PYSPARK_SUBMIT_ARGS="--master yarn pyspark-shell"

Look into java_gateway.py in the pyspark code for further understanding. We will be soon moving to HDP3.0 and will update if this is necessary in the latest version.

Upvotes: 1

Related Questions