Notebook vs spark-submit

Question

I'm very new to PySpark.

I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?

Also the same thing happens (the excess time) if I run the code using python from terminal.

I am also setting the configuration in the script as

conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')

Any help is appreciated. Thanks in advance.

Neeraj Bhadani · Accepted Answer

There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.

Regarding Jupyter Notebook or pyspark shell.

While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.

Regarding spark-submit.

However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.

Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).

sc = SparkContext(conf=conf)

Hope this helps.

Regards,

Neeraj

Notebook vs spark-submit

Answers (2)

Related Questions