Reputation: 45
I'm reading the Book: Spark: "The Definitive Guide: Big Data Processing Made Simple"
which came out in 2018, and now is 2023, so the book mentioned that using UDFs written on Python aren't efficient, same with using Python code on RDD's, is still that true?
Upvotes: 1
Views: 1620
Reputation: 11
Spark starts a Python process inside the worker nodes, and the UDF will run inside that Python environment. At this point, there are two processes running inside each worker:
The JVM process, which executes the other transformations written in Spark SQL or PySpark.
The Python process, which processes the UDF logic.
For the Python process to execute the UDF, the data must be transferred between the JVM process and the Python process. This involves serializing the data in the JVM process before sending it to the Python process. After the UDF processes the data row by row, the results are sent back to the JVM process. During this transfer, the data is deserialized in Python and then sent back again to JVM.
This movement of data between the JVM and Python processes, along with the serialization and deserialization, is the root cause of the slow behavior of Python UDFs in Spark.
For more information checkout the LinkedIn post: Why Python UDFs are slow in Spark
Upvotes: 1
Reputation: 18013
Old knowledge but still applicable as per below:
UDF's are slower in general as Catalyst cannot optimize them - they are blackboxes. See https://www.codemotion.com/magazine/ai-ml/big-data/light-up-the-spark-in-catalyst-by-avoiding-udf/
For python vs. Scala UDF's, yes python is slower. Reading this https://medium.com/quantumblack/spark-udf-deep-insights-in-performance-f0a95a4d8c62 gives a good insight on things. I use Scala when I assist, so this is not my key area but, summarizing an excellent post:
On RDDs, they cannot be optimized by Catalyst for pyspark or Scala. Should be using DataFrames, DataSets.
Upvotes: 2