Reputation: 2068
The common way of running a spark job appears to be using spark-submit as below (source):
spark-submit --py-files pyfile.py,zipfile.zip main.py --arg1 val1
Being newer to spark, I wanted to know why this first method is preferred over running it from python (example):
python pyfile-that-uses-pyspark.py
The former method yields many more examples when googling the topic, but not explicitly stated reasons for it. In fact, here is another Stack Overflow question where one answer, repeated below, specifically tells the OP not to use the python method, but does not give a reason why.
dont run your py file as: python filename.py instead use: spark-submit filename.py
Can someone provide insight?
Upvotes: 5
Views: 3910
Reputation: 191844
The slightly longer answer, other than saying the Anaconda docs linked are wrong, and the official documentation never tells you to use python
, is that Spark requires a JVM.
spark-submit
is a wrapper around a JVM process that sets up the classpath, downloads packages, verifies some configuration, among other things. Running python
bypasses this, and would have to all be re-built into pyspark/__init__.py
so that those processes get ran when imported.
Upvotes: 2
Reputation: 2525
@mint Your comment is more or less correct.
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one.
As I understand, using python pyfile-that-uses-pyspark.py
cannt launch an application on a cluster, or it's at least more difficult to do so.
Upvotes: 1