Reputation: 115
I am trying flink for a project at work. I have reached the point where I process a stream by applying count windowing, etc. However, I noticed a peculiar behavior, which I cannot explain.
It seems that a stream is processed by two threads, and the output is also separated in two parts.
First I noticed the behavior when printing the stream to standard console by using stream.print()
.
Then, I printed to a file and it is actually printing in two files named 1
and 2
, in the output folder.
SingleOutputStreamOperator<Tuple3<String, String,String>> c = stream_with_no_err.countWindow(4).apply(new CountPerWindowFunction());
// c.print() // this olso prints two streams in the standard console
c.writeAsCsv("output");
Can someone please explain why is this behavior in flink? How can I configure it? Why is it necessary to have the resulting stream split?
Parallelism I understand as being useful for speed (multiple threads), but why having the resulting stream split?
Usually, I would like to have the resulting stream (after the processing) as a single file, or tcp stream ,etc. Is the normal workflow to manually combine the two files and produce a single output?
Thanks!
Upvotes: 2
Views: 1282
Reputation: 18987
Flink is a distributed and parallel stream processor. As you said correctly, parallelization is necessary to achieve high throughput. The throughput of an application is bounded by its slowest operator. So in many cases also the sink needs to be parallelized.
Having said this, it is super simple to reduce the parallelism of your sink to 1:
c.writeAsCsv("output").setParallelism(1);
Now, the sink will run as a single thread and only produce a single file.
Upvotes: 2