what is the different between pandas_udf from Pyspark and to_pandas?

Question

When I clean big data by pandas, I have two methods：one method is to use @pandas_udf from pyspark 2.3+ clean data，another is to convert sdf to pdf by toPandas() ,and then use pandas to clean.
I'm confused what are these methods different?

I hope helper could explain from distributed, speed and other directions.

akuiper · Accepted Answer

TL;DR: @pandas_udf and toPandas are very different;

@pandas_udf

Creates a vectorized user defined function (UDF).

which leverages the vectorization feature of pandas and serves as a faster alternative for udf, and it works on distributed dataset; To learn more about the pandas_udf performance, you can read pandas_udf vs udf performance benchmark here.

While toPandas collect the distributed spark data frame as pandas data frame, pandas data frame is localized, and resides in driver's memory so:

this method should only be used if the resulting Pandas’s DataFrame is expected to be small, as all the data is loaded into the driver’s memory.

So if your data is large, then you can't use toPandas; @pandas_udf or udf or other built in methods would be your only option;

what is the different between pandas_udf from Pyspark and to_pandas?

Answers (1)

Related Questions