lu5er
lu5er

Reputation: 3564

Running a PySpark code in python vs spark-submit

I have a PySpark code/application. What is the best way to run it (utilize the maximum power of PySpark), using the python interpreter or using spark-submit?

The SO answer here was almost similar but did not explain it in great details. Would love to know, why?

Any help is appreciated. Thanks in advance.

Upvotes: 1

Views: 2031

Answers (2)

srikanth holur
srikanth holur

Reputation: 780

Running your job on pyspark shell will always be in client mode. Where as using spark-submit you can execute it in either modes. I.e. client or cluster

Upvotes: 2

Neeraj Bhadani
Neeraj Bhadani

Reputation: 3110

I am assuming when you say python interpreter you are referring to pyspark shell.

You can run your spark code both ways using pySpark interpreter, using Spark-submit or even with multiple available notebooks (Jupyter/Zeppelin).

  1. When to use PySpark Interpreter.

Generally when we are learning or doing some very basic operations for an understanding or exploration purpose we use pySpark interpreter.

  1. Spark Submit.

This is usually used when you have written your entire application in pySpark and packaged into py files, so that you can submit your entire code to Spark cluster for execution.

A little analogy may help here. Let's take an example of Unix shell commands. We can execute the shell commands directly on the command prompt or we can create shell script (.sh) to execute the bunch instruction at once. Similarly, you can think of pyspark interpreter and spark-submit utility, where in pySpark interpreter you can execute individual command. However, you can package your spark application into py files and execute using spark-submit utility.

Hope this helps.

Regards,

Neeraj

Upvotes: 1

Related Questions