Elmauro
Elmauro

Reputation: 45

Is Python UDF still inefficient in Spark?

I'm reading the Book: Spark: "The Definitive Guide: Big Data Processing Made Simple" which came out in 2018, and now is 2023, so the book mentioned that using UDFs written on Python aren't efficient, same with using Python code on RDD's, is still that true?

Upvotes: 1

Views: 1620

Answers (2)

Manish Sharma
Manish Sharma

Reputation: 11

Spark starts a Python process inside the worker nodes, and the UDF will run inside that Python environment. At this point, there are two processes running inside each worker:

  1. The JVM process, which executes the other transformations written in Spark SQL or PySpark.

  2. The Python process, which processes the UDF logic.

For the Python process to execute the UDF, the data must be transferred between the JVM process and the Python process. This involves serializing the data in the JVM process before sending it to the Python process. After the UDF processes the data row by row, the results are sent back to the JVM process. During this transfer, the data is deserialized in Python and then sent back again to JVM.

This movement of data between the JVM and Python processes, along with the serialization and deserialization, is the root cause of the slow behavior of Python UDFs in Spark.

For more information checkout the LinkedIn post: Why Python UDFs are slow in Spark

Upvotes: 1

Ged
Ged

Reputation: 18013

Old knowledge but still applicable as per below:

Upvotes: 2

Related Questions