How good is the parallelism of Stream Processing Systems?

Question

Consider that we are going to compute the average of a number of temperature sensors in a given period of time and this computation will be done in a parallel fashion using a SPE. Usually, this computation is done by at least four UDF:

map -> keyBy -> window -> aggregate

If my keyBy operator is responsible to get the ID of each sensor and I have only 2 sensors, the parallelism of 2 is enough to my application (disclaimer: I don't want to consider how large is the window or the tuples to be fit in memory for now). If I have 1000 sensors it will be very nice to increase the parallelism. Let's say to 100 nodes. But what if my parallelism is set to 100 and I am processing tuples only of 2 sensors. Will I have 98 nodes idle? Do Spark, Flink, or Storm knows that they don't have to shuffle data to the 98 nodes?

The motivation for my question is this other question.

What kind of application and scenario can I implement which shows that the current Stream Processing Engines (Storm, Flink, Spark) don't know how to optimize the parallelism internally in order to shuffle fewer data across the network?
Can they predict any characteristic of the data volume or variety? or the resources underneath the hood?

Thanks

John · Accepted Answer

The whole point of keyBy() is to distribute items with the same key to the same operator. If you have 2 keys, your items are literally being split into 2 groups and your max parallelism for this stream is 2. Items with key A will be sent to one operator and items with key B will be sent to another operator.

Within Flink, if you want to just distribute the processing of your items amongst all of the parallel operators then you can use DataStream::shuffle().

How good is the parallelism of Stream Processing Systems?

Answers (1)

Related Questions