Reputation: 25366
I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find which Stage is corresponding to which line of code in the pyspark code.
Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?
Thanks!
Upvotes: 22
Views: 3339
Reputation: 419
When you run a toPandas call, the line in the python code is shown in the SQL tab. Other collect commands, such as count or parquet do not show the line number. I'm not sure why this is, but I find it can be very handy.
Upvotes: 1
Reputation: 1455
Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?
Yes. The Spark UI provides the Scala methods called from the PySpark actions in your Python code. Armed with the PySpark codebase, you can readily identify the calling PySpark method. In your example, cache
is self-explanatory and a quick search for javaToPython
reveals that it is called by the PySpark DataFrame.rdd
method.
Upvotes: 1