Reputation: 12371
I'm trying to figure out slot sharing and parallelism in Flink with the example WordCount.
Saying that I need to do the word count job with Flink, there are only one data source and only one sink.
In this case, can I make a design just like the image above? I mean, I set two sub-tasks on Source + map()
and two sub-tasks on keyBy()/window()/apply()
, in other words, I have two lines: A --- B --- Sink
and C --- D --- Sink
so that I can get a better performance.
For example, there is a data stream coming: aaa
, bbb
, aaa
. With the design above, I may get such a situation: aaa
and bbb
goes into the A --- B
and the other aaa
goes into the C --- D
. And finally, I can get the result aaa: 2, bbb: 1
at the Sink
. Am I right for now?
If I'm right, I know that subtasks of the same task cannot share a slot, so does it mean that A
and C
can't share a slot, B
and D
can't share a slot? Am I right? How do I assign the slots? Should I put A + B + Sink
into one slot and put C + D
into another slot?
Upvotes: 0
Views: 941
Reputation: 43409
Slot sharing is enabled by default. With slot sharing enabled, the number of slots required is the same as the parallelism of the task with the highest parallelism (which is two in this case).
In this example the scheduler will put A + B + Sink
into one slot, and C + D
into another. This isn't something you normally need to configure or even give much thought to, as the defaults work well in most cases.
If you were to completely disable slot sharing, then this job would require 5 slots, one for each of A, B, C, D, and the sink. But disabling slot sharing is almost never a good idea. Just make sure each slot has sufficient resources to run all of the subtasks concurrently.
Upvotes: 1