Converting a PySpark data frame to a PySpark.pandas data frame

Question

In this link, users have the ability to work with pandas on top of PySpark in Spark 3.2. Does it take a long time to convert a PySpark data frame to a PySpark pandas data frame?

I know it would take a long time to convert a PySpark data frame to a pandas data frame.

devesh · Accepted Answer

Spark is developed in the Scala language and the JVM is started at the underlying layer and PySpark is a Python sub-process started by the PythonRDD object in Scala. Py4J is used to communication between Python and JVM, and Java objects in JVM can be dynamically accessed through Py4J Python using the Linux pipe. RDD needs to be serialized in the underlying JVM and deserialized in Python.So dealing with large data volumes, this will be far less efficient than directly using Scala.

And to efficiently transfer data between JVM and Python processes We can configure Apache arrow with spark here are some links for same

Official Apache arrow documentation

Apache arrow and spark configuration

Converting a PySpark data frame to a PySpark.pandas data frame

Answers (2)

Related Questions