Running a PySpark code in python vs spark-submit

Question

I have a PySpark code/application. What is the best way to run it (utilize the maximum power of PySpark), using the python interpreter or using spark-submit?

The SO answer here was almost similar but did not explain it in great details. Would love to know, why?

Any help is appreciated. Thanks in advance.

Neeraj Bhadani · Accepted Answer

I am assuming when you say python interpreter you are referring to pyspark shell.

You can run your spark code both ways using pySpark interpreter, using Spark-submit or even with multiple available notebooks (Jupyter/Zeppelin).

When to use PySpark Interpreter.

Generally when we are learning or doing some very basic operations for an understanding or exploration purpose we use pySpark interpreter.

Spark Submit.

This is usually used when you have written your entire application in pySpark and packaged into py files, so that you can submit your entire code to Spark cluster for execution.

A little analogy may help here. Let's take an example of Unix shell commands. We can execute the shell commands directly on the command prompt or we can create shell script (.sh) to execute the bunch instruction at once. Similarly, you can think of pyspark interpreter and spark-submit utility, where in pySpark interpreter you can execute individual command. However, you can package your spark application into py files and execute using spark-submit utility.

Hope this helps.

Regards,

Neeraj

Running a PySpark code in python vs spark-submit

Answers (2)

Related Questions