PySpark job only partly running

Question

I have a PySpark script that I am running locally with spark-submit in Docker. In my script I have a call toPandas() on a PySpark DataFrame and afterwards I have various manipulations of the DataFrame, finishing in a call to to_csv() to write results to a local CSV file.

When I run this script, the code after the call to toPandas() does not appear to run. I have log statements before this method call and afterwards, however only the log entries before the call show up on the spark-submit console output. I have thought that maybe this is due to the rest of the code being run in a separate executor process by Spark, so the logs don't show on the console. If this is true, how can I see my application logs for the executor? I have enabled the event log with spark.eventLog.enabled=true but this seems to show only internal events, not my actual application log statements.

Even if the assumption about executor logs above is true or false, I don't see the CSV file written to the path that I expect (/tmp). Further, the history server says No completed applications found! when I start it, configuring it to read the event log (spark.history.fs.logDirectory for the history server, spark.eventLog.dir for Spark). It does show an incomplete application, the only complete job listed there is for my toPandas() call.

How am I supposed to figure out what's happening? Spark shows no errors at any point, and I can't seem to view my own application logs.

PySpark job only partly running

Answers (1)

Related Questions