Edamame
Edamame

Reputation: 25366

SparkUI for pyspark - corresponding line of code for each stage?

I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find which Stage is corresponding to which line of code in the pyspark code.

Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?

Thanks!

enter image description here

Upvotes: 22

Views: 3339

Answers (2)

Chogg
Chogg

Reputation: 419

When you run a toPandas call, the line in the python code is shown in the SQL tab. Other collect commands, such as count or parquet do not show the line number. I'm not sure why this is, but I find it can be very handy.

Upvotes: 1

Chris
Chris

Reputation: 1455

Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?

Yes. The Spark UI provides the Scala methods called from the PySpark actions in your Python code. Armed with the PySpark codebase, you can readily identify the calling PySpark method. In your example, cache is self-explanatory and a quick search for javaToPython reveals that it is called by the PySpark DataFrame.rdd method.

Upvotes: 1

Related Questions