Reputation: 13
When I clean big data by pandas, I have two methods:one method is to use @pandas_udf
from pyspark 2.3+
clean data,another is to convert sdf
to pdf
by toPandas()
,and then use pandas to clean.
I'm confused what are these methods different?
I hope helper could explain from distributed, speed and other directions.
Upvotes: 1
Views: 3511
Reputation: 215117
TL;DR: @pandas_udf
and toPandas
are very different;
@pandas_udf
Creates a vectorized user defined function (UDF).
which leverages the vectorization feature of pandas and serves as a faster alternative for udf
, and it works on distributed dataset; To learn more about the pandas_udf
performance, you can read pandas_udf vs udf performance benchmark here.
While toPandas
collect the distributed spark data frame as pandas data frame, pandas data frame is localized, and resides in driver's memory so:
this method should only be used if the resulting Pandas’s DataFrame is expected to be small, as all the data is loaded into the driver’s memory.
So if your data is large, then you can't use toPandas
; @pandas_udf
or udf
or other built in methods would be your only option;
Upvotes: 1