Reputation: 7563
Consider that we are going to compute the average of a number of temperature sensors in a given period of time and this computation will be done in a parallel fashion using a SPE. Usually, this computation is done by at least four UDF:
map -> keyBy -> window -> aggregate
If my keyBy
operator is responsible to get the ID of each sensor and I have only 2 sensors, the parallelism of 2 is enough to my application (disclaimer: I don't want to consider how large is the window or the tuples to be fit in memory for now).
If I have 1000 sensors it will be very nice to increase the parallelism. Let's say to 100 nodes.
But what if my parallelism is set to 100 and I am processing tuples only of 2 sensors. Will I have 98 nodes idle? Do Spark, Flink, or Storm knows that they don't have to shuffle data to the 98 nodes?
The motivation for my question is this other question.
Thanks
Upvotes: 0
Views: 36
Reputation: 11831
The whole point of keyBy()
is to distribute items with the same key to the same operator. If you have 2 keys, your items are literally being split into 2 groups and your max parallelism for this stream is 2. Items with key A
will be sent to one operator and items with key B
will be sent to another operator.
Within Flink, if you want to just distribute the processing of your items amongst all of the parallel operators then you can use DataStream::shuffle().
Upvotes: 1