Reputation: 1153
I have a PySpark script that I am running locally with spark-submit
in Docker. In my script I have a call toPandas()
on a PySpark DataFrame and afterwards I have various manipulations of the DataFrame, finishing in a call to to_csv()
to write results to a local CSV file.
When I run this script, the code after the call to toPandas()
does not appear to run. I have log statements before this method call and afterwards, however only the log entries before the call show up on the spark-submit
console output. I have thought that maybe this is due to the rest of the code being run in a separate executor process by Spark, so the logs don't show on the console. If this is true, how can I see my application logs for the executor? I have enabled the event log with spark.eventLog.enabled=true
but this seems to show only internal events, not my actual application log statements.
Even if the assumption about executor logs above is true or false, I don't see the CSV file written to the path that I expect (/tmp
). Further, the history server says No completed applications found!
when I start it, configuring it to read the event log (spark.history.fs.logDirectory
for the history server, spark.eventLog.dir
for Spark). It does show an incomplete application, the only complete job listed there is for my toPandas()
call.
How am I supposed to figure out what's happening? Spark shows no errors at any point, and I can't seem to view my own application logs.
Upvotes: 0
Views: 132
Reputation: 2033
When you use the toPandas()
to convert your spark dataframe to your pandas dataframe, it's actually a heavy action because it will pull all the records to the driver.
Remember that Spark is a distributed computing engine and it's doing the parallel computing. Therefore your data will be distributed to different node and it's completely different to pandas dataframe, since pandas works on single machine but spark work in cluster. You can check this post: why does python dataFrames' are localted only in the same machine?
Back to your post, actually it covers 2 questions:
toPandas()
: As mentioned above, Spark is a distributed computing engine. The event log will only save the job details which appear in Spark computation DAG. Other non Spark log will not be saved in spark log, if you really those log, you need to use external library like logging
to collect the logs in driver./tmp
dir: As you mentioned that when you check the event log, there is a an incomplete application but not a failed application, I believe you dataframe is so huge that you collection has not finished and your transformation in pandas dataframe has even not yet started. You can try to collect few record, let's say df.limit(20).toPandas()
to see if it works or not. If it works, that means your dataframe that converts to pandas is so large and it takes time. If it's not work, maybe you can share more about the error traceback.Upvotes: 1