Reputation: 101

Do I get a speed up running python libraries on pyspark

I've tried to read through and understand exactly where the speed up is coming from in spark when I run python libraries like pandas or scikit learn but I don't see anything particularly informative. If I can get the same speedup without using the pyspark dataframes can I just deploy code using pandas and it will perform roughly the same?

I suppose my question is:

If I have working pandas code should I translate it to PySpark for efficiency or not?

Upvotes: 0

Answers (1)

user9977315

Reputation: 11

If you ask if you get any speedup by starting arbitrary Python code on the driver node the answer is negative. Driver is a plain Python interpreter, it doesn't affect you code in "magic" way.

If I have working pandas code should I translate it to PySpark for efficiency or not?

If you want to get benefits of distributed computing then you have to rewrite your code using distribute primitives. However it is not a free lunch:

You problem might not distribute well.
Even if does, amount of data might not justify distribution - How to add a <br/> after each result, but not last result?

In other words - if your code works just fine with Pandas or Scikit Learn, there is little chance you'll get anything from rewriting it to Spark.

Upvotes: 1

Do I get a speed up running python libraries on pyspark

Answers (1)

Related Questions