AACosgrove
AACosgrove

Reputation: 95

Converting a PySpark data frame to a PySpark.pandas data frame

In this link, users have the ability to work with pandas on top of PySpark in Spark 3.2. Does it take a long time to convert a PySpark data frame to a PySpark pandas data frame?

I know it would take a long time to convert a PySpark data frame to a pandas data frame.

Upvotes: 1

Views: 232

Answers (2)

devesh
devesh

Reputation: 668

Spark is developed in the Scala language and the JVM is started at the underlying layer and PySpark is a Python sub-process started by the PythonRDD object in Scala. Py4J is used to communication between Python and JVM, and Java objects in JVM can be dynamically accessed through Py4J Python using the Linux pipe. RDD needs to be serialized in the underlying JVM and deserialized in Python.So dealing with large data volumes, this will be far less efficient than directly using Scala.

JVM python py4j interation

And to efficiently transfer data between JVM and Python processes We can configure Apache arrow with spark here are some links for same

Official Apache arrow documentation

Apache arrow and spark configuration

Upvotes: 1

Vaebhav
Vaebhav

Reputation: 5032

You can go through the link and examples here.

The above link depicts the conversion of Pandas To Spark and vice versa.

Upvotes: 0

Related Questions