Reputation: 2735
This might be a stupid question, but I can't seem to find any doc clarifying this in pure English (ok, exaggerated), and after reading the official doc and some blogs, I'm still confused about how driver and executors work.
Here is my current understanding:
1) Driver defines the transformation/computation.
2) Once we call SparkContext.start()
, the driver will send the defined transformation/computation to all the executors, such that each executor knows how to process the incoming RDD stream data.
OK, here is some confusing questions I have:
1) Does the driver send the defined transformation/computation to all executors only ONCE AND FOR ALL?
If this is the case, we wouldn't have any chance to redefine/change the computation, right?
For example, I do a word-count job similar to this one, however my job is a little complicated, I want to count only words starting with letter J
for the first 60s, and then only words starting with letter K
for the next 60s, and then only words starting with ......, that goes on.
So how am I supposed to implement this streaming job in the driver?
2) Or does the driver restart/reschedule all the executors after each batch of data is done?
FOLLOWUP
To solve the 1) question, I think I could make use of some external storage media, like redis
, I mean I could implement a processing function count_fn
in the driver, and each time when this count_fn
is called, it will read from redis to get the starting letter, and then count in the RDD stream, is this a right way to go?
Upvotes: 2
Views: 89
Reputation: 149518
Does the driver send the defined transformation/computation to all executors only ONCE AND FOR ALL?
No, each Task is serialized and sent to all the workers per batch iteration. Think what happens when you have an instance of a class that is used inside a transformation, Spark has to able to ship that same instance with all it's state to each of the executors to operate on.
If this is the case, we wouldn't have any chance to redefine/change the computation, right?
The logic inside the definition of the transformation is constant, but that doesn't mean you can't query a third party which stores information which effects the data inside your transformation.
For example, assume you had some external source which indicates which letters you should filter by. Then you can either call transform
on the DStream to fetch data from the driver regarding which letter to filter by.
Or does the driver restart/reschedule all the executors after each batch of data is done?
It doesn't restart, it simply starts a new job per batch interval. If you defined your StreamingContext
batch duration to be 60 seconds, every 60 seconds a new job (micro batch) will start processing data.
As per your follow up, yes that is the way I'd do it.
Upvotes: 2