Reputation: 2057
How pyspark topandas
work internally . I am aware of The Spark DataFrame can be converted to a Pandas DataFrame as spark_df.toPandas using topandas method.
After triggering method topandas
, does it pull all the data to driver and converts to pandas data frame or does the conversion happen in workers and pandas dataframe will be created locally to the worker nodes?
Upvotes: 0
Views: 360
Reputation: 45319
Pandas data frames are not distributed. toPandas()
will cause data frame rows to be collected to the driver and then converted as one Pandas data frame, as mentioned in the docs:
toPandas()
Collect all the rows and return a pandas.DataFrame.
So all the warnings regarding collection of data onto a single node (the driver, in this case) apply to toPandas
too.
Upvotes: 1