Reputation: 5049
Flink's documentation for GroupCombine
states:
Note: The GroupCombine on a Grouped DataSet is performed in memory with a greedy strategy which may not process all data at once but in multiple steps. It is also performed on the individual partitions without a data exchange like in a GroupReduce transformation. This may lead to partial results.
With the following remark for full (non-grouped) DataSet
s:
The GroupCombine on a full DataSet works similar to the GroupCombine on a grouped DataSet. The data is partitioned on all nodes and then combined in a greedy fashion (i.e. only data fitting into memory is combined at once).
Does this mean that if my dataset consists of, for example:
1
2
3
and I want to generate all pairwise combinations
(1, 2), (1, 3), (2, 3)
I cannot implement this in a general way with a GroupCombine
transformation because it doesn't guarantee that the whole group will fit in a given partition's memory?
Upvotes: 0
Views: 80
Reputation: 18987
GroupCombine
is a non-deterministic operation in Flink. It is typically used to perform partial compuations (like aggregations) and is followed by a deterministic operation like GroupReduce
that consumes the partial results. GroupCombine
is typically used to reduce the cost of the deterministic operation by performing less-expensive local, in-memory computations.
If you need compute deterministic results on groups of records, you should use GroupReduce
.
Upvotes: 0