How to use custom type-safe aggregator in Spark SQL

Question

The Spark documentation describes how to create both a untyped user defined aggregate function (code) (aka udaf) and a strongly-typed aggregator (code) (aka a subclass of org.apache.spark.sql.expressions.Aggregator).

I know you can register a udaf for use in sql via spark.udf.register("udafName", udafInstance), then use it like spark.sql("SELECT udafName(V) as aggV FROM data").

Is there a way to use an aggregator in sql too?

user10010023 · Accepted Answer

Not really Aggregator API is designed specifically with "strongly" typed Datasets in mind. You'll notice, that it doesn't take Columns but always operates on whole record objects.

This doesn't really fit into SQL processing model:

In SQL you always operate on Dataset[Row]. Not much use for Aggregator.
Operations are applied on columns while Aggregator takes a complete Row.

For use with SQL API you can create UserDefinedAggregateFunction which can be registered using standard methods.

How to use custom type-safe aggregator in Spark SQL

Answers (1)

Related Questions