See progress while "iterating" over Dataframe

Question

I wonder if there is a better way to see if Pyspark is making progress (while writing to a PL/SQL DB). Currently the only output i see, while my code is running is:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/09/17 16:33:17 WARN JdbcUtils: Requested isolation level 1 is not > supported; falling back to default isolation level 2
[Stage 3:=============================> (1 + 1) / 2]

This will stay the same from 1 minute to 1 hour depending on the dataframe size. Normally i would use progessbar2 or make a counter myself. But Spark works different and does not "iterate" the classic way, so i can not wrap the udf with the progressbar2 lib.

The Problem is, it is difficult to see if my program just runs over a large dataframe or someone has forgotten to commit to the SQL DB. Because, when Pyspark is waiting for a commit, it looks just the same. So as you may guessed, i have wasted plenty of time there.

df_c = df_a.withColumn("new_col", my_udf(df_b["some_col"]))

Would be nice to see some sort of progress from pyspark, while doing this step.

See progress while "iterating" over Dataframe

Answers (1)

Related Questions

See progress while &quot;iterating&quot; over Dataframe

Answers (1)

Related Questions

See progress while "iterating" over Dataframe