Iterator behaviour in flink reduceGroup

Question

I am creating a system that should handle huge amount of data and I need to understand how the reduce group operator works

I have a dataset where I apply a groupby and subsequently a reduceGroup How does the iterator that is passed to the reduceGroup function behave? is it a lazy iterator that loads data when they are requested or an eager one that prepares all the data in memory when it is created?

i am using the scala api in flink 0.9 milestone1

Fabian Hueske · Accepted Answer

Flink performs the group-by for a groupReduce using a sort operator. The sort operator receives a certain memory budget for sorting. As long as the data fits into this budget, the sort will happen be in-memory. Otherwise, the sort becomes an external merge-sort and spills to disk. Flink reads the sorted data stream and applies the groupReduce function "on-the-fly". The data of a group is not completely read in-memory before the function is applied. Hence, you can process very large groups if the user-function does not materialize group records itself.

Iterator behaviour in flink reduceGroup

Answers (1)

Related Questions