Reputation: 3180
In Spark's documentation, Aggregator:
abstract class Aggregator[-IN, BUF, OUT] extends Serializable
A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value.
UserDefinedAggregateFunction is:
abstract class UserDefinedAggregateFunction extends Serializable
The base class for implementing user-defined aggregate functions (UDAF).
According to Dataset Aggregator - Databricks, “an Aggregator is similar to a UDAF, but the interface is expressed in terms of JVM objects instead of as a Row .”
It seems these two classes are very similar, what are other differences apart from the types in the interface?
A similar question is: Performance of UDAF versus Aggregator in Spark
Upvotes: 6
Views: 1802
Reputation: 96
A fundamental difference, apart from types, is external interface:
Aggregator
takes a complete Row
(it is intended for "strongly" typed API).UserDefinedAggregationFunction
takes a set of Columns
.This makes Aggregator
less flexible, although overall API is far more user friendly.
There is also a difference with handling state:
Aggregator
is stateful. Depends on mutable internal state of its buffer field.UserDefinedAggregateFunction
is stateless. State of the buffer is external.Upvotes: 8