Jary zhen
Jary zhen

Reputation: 447

How do Apache Flink's JoinFunction and CoGroupFunction differ?

What is the difference between a JoinFunction and a CoGroupFunction in Apache Flink? How do semantics and execution differ?

Upvotes: 10

Views: 4132

Answers (1)

Fabian Hueske
Fabian Hueske

Reputation: 18987

Both, Join and CoGroup transformations join two inputs on key fields. The differences is how the user functions are called:

  • the Join transformation calls the JoinFunction with pairs of matching records from both inputs that have the same values for key fields. This behavior is very similar to an equality inner join.
  • the CoGroup transformation calls the CoGroupFunction with iterators over all records of both inputs that have the same values for key fields. If an input has no records for a certain key value an empty iterator is passed. The CoGroup transformation can be used, among other things, for inner and outer equality joins. It is hence more generic than the Join transformation.

Looking at the execution strategies of Join and CoGroup, Join can be executed using sort- and hash-based join strategies where as CoGroup is always executed using sort-based strategies. Hence, joins are often more efficient than cogroups and should be preferred if possible.

Upvotes: 27

Related Questions