How can I accelerate the caching in Spark (Pyspark)?

I need to cache a dataframe in Pyspark(2.4.4), and the memory caching is slow.

I benchmark the Pandas caching with Spark caching, by reading the same file(CSV). Specifically, Pandas was 3-4 times faster.

Thanks, In advance

Upvotes: 0

Answers (1)

cronoik

Reputation: 19310

You are comparing apples and oranges. Pandas is a single machine single core data analysis library whereas pyspark is distributed (cluster computing) data analysis engine. That means you will never outperform pandas reading a small file on a single machine with pyspark due to the overhead (distributed architecture, JVM...). That also means that pyspark will outperform pandas as soon as your file exceeds a certain size.

You as a developer has to choose the solution which best fits your requirements. When pandas is faster for your project and you don't expect a huge increase of data in the future, use pandas. Otherwise use pyspark or dask or...

Upvotes: 4

How can I accelerate the caching in Spark (Pyspark)?

Answers (1)

Related Questions