Reputation: 336
I wonder if there is a better way to see if Pyspark is making progress (while writing to a PL/SQL DB). Currently the only output i see, while my code is running is:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/09/17 16:33:17 WARN JdbcUtils: Requested isolation level 1 is not > supported; falling back to default isolation level 2
[Stage 3:=============================> (1 + 1) / 2]
This will stay the same from 1 minute to 1 hour depending on the dataframe size. Normally i would use progessbar2 or make a counter myself. But Spark works different and does not "iterate" the classic way, so i can not wrap the udf with the progressbar2 lib.
The Problem is, it is difficult to see if my program just runs over a large dataframe or someone has forgotten to commit to the SQL DB. Because, when Pyspark is waiting for a commit, it looks just the same. So as you may guessed, i have wasted plenty of time there.
df_c = df_a.withColumn("new_col", my_udf(df_b["some_col"]))
Would be nice to see some sort of progress from pyspark, while doing this step.
Upvotes: 3
Views: 1777
Reputation: 14845
You can check on the Spark-UI what your Spark cluster is currently doing. Here you can check if Spark tasks are being completed or if everything hangs. The default URL of the Spark UI is http://<driver-node>:4040
.
If you need the data in a more structured way (for example for automated processing) you can use the Spark-UI's REST-Interface.
Upvotes: 2