Reputation:
I need to cache a dataframe in Pyspark(2.4.4), and the memory caching is slow.
I benchmark the Pandas caching with Spark caching, by reading the same file(CSV). Specifically, Pandas was 3-4 times faster.
Thanks, In advance
Upvotes: 0
Views: 238
Reputation: 19310
You are comparing apples and oranges. Pandas is a single machine single core data analysis library whereas pyspark is distributed (cluster computing) data analysis engine. That means you will never outperform pandas reading a small file on a single machine with pyspark due to the overhead (distributed architecture, JVM...). That also means that pyspark will outperform pandas as soon as your file exceeds a certain size.
You as a developer has to choose the solution which best fits your requirements. When pandas is faster for your project and you don't expect a huge increase of data in the future, use pandas. Otherwise use pyspark or dask or...
Upvotes: 4