Reputation: 447
What is the difference between a JoinFunction
and a CoGroupFunction
in Apache Flink? How do semantics and execution differ?
Upvotes: 10
Views: 4132
Reputation: 18987
Both, Join and CoGroup transformations join two inputs on key fields. The differences is how the user functions are called:
JoinFunction
with pairs of matching records from both inputs that have the same values for key fields. This behavior is very similar to an equality inner join.CoGroupFunction
with iterators over all records of both inputs that have the same values for key fields. If an input has no records for a certain key value an empty iterator is passed. The CoGroup transformation can be used, among other things, for inner and outer equality joins. It is hence more generic than the Join transformation.Looking at the execution strategies of Join and CoGroup, Join can be executed using sort- and hash-based join strategies where as CoGroup is always executed using sort-based strategies. Hence, joins are often more efficient than cogroups and should be preferred if possible.
Upvotes: 27