Nag
Nag

Reputation: 2057

pyspark the method toPandas internal

How pyspark topandas work internally . I am aware of The Spark DataFrame can be converted to a Pandas DataFrame as spark_df.toPandas using topandas method.

After triggering method topandas, does it pull all the data to driver and converts to pandas data frame or does the conversion happen in workers and pandas dataframe will be created locally to the worker nodes?

Upvotes: 0

Views: 360

Answers (1)

ernest_k
ernest_k

Reputation: 45319

Pandas data frames are not distributed. toPandas() will cause data frame rows to be collected to the driver and then converted as one Pandas data frame, as mentioned in the docs:

toPandas()
Collect all the rows and return a pandas.DataFrame.

So all the warnings regarding collection of data onto a single node (the driver, in this case) apply to toPandas too.

Upvotes: 1

Related Questions