Tom
Tom

Reputation: 101

Do I get a speed up running python libraries on pyspark

I've tried to read through and understand exactly where the speed up is coming from in spark when I run python libraries like pandas or scikit learn but I don't see anything particularly informative. If I can get the same speedup without using the pyspark dataframes can I just deploy code using pandas and it will perform roughly the same?

I suppose my question is:

If I have working pandas code should I translate it to PySpark for efficiency or not?

Upvotes: 0

Views: 391

Answers (1)

user9977315
user9977315

Reputation: 11

If you ask if you get any speedup by starting arbitrary Python code on the driver node the answer is negative. Driver is a plain Python interpreter, it doesn't affect you code in "magic" way.

If I have working pandas code should I translate it to PySpark for efficiency or not?

If you want to get benefits of distributed computing then you have to rewrite your code using distribute primitives. However it is not a free lunch:

In other words - if your code works just fine with Pandas or Scikit Learn, there is little chance you'll get anything from rewriting it to Spark.

Upvotes: 1

Related Questions