Viktor Khristenko
Viktor Khristenko

Reputation: 933

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics. The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.

Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).

  1. Is that possible/feasible???
  2. I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
  3. I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk

Upvotes: 2

Views: 393

Answers (1)

Jacek Laskowski
Jacek Laskowski

Reputation: 74669

SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.

That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.

I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).

I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Upvotes: 1

Related Questions