Amir Tuval
Amir Tuval

Reputation: 317

Implementing a custom Apache pig algebraic UDF

Everyone

I implemented a custom aggregate pig UDF. The UDF implements the Algebraic interface, and there are 3 classes - Initial, Intermed and Final to do work at the different phases. It works correctly, but somewhat inefficiently.

The UDF uses an algorithm which is a bit heavy - especially when running on a single value. It will work much more efficiently when running on bigger groups of data - say - 100 at a time. What I observed is that the Initial class is always invoked with a single value, and later combined with the Intermed and Final classes.

I am aware the that there's the Accumulator interface for such cases, but I could not find documentation on how to use it with an Algebraic UDF.

So my question is - is there a way for me to "force" pig to pass more values to the Initial calculation - either using the Accumulator interface or via some other way.

An explanantion or a pointer to documentation or a sample would be much appreciated.

Thanks Amir

Upvotes: 1

Views: 672

Answers (1)

Amir Tuval
Amir Tuval

Reputation: 317

It seems a Pig's Algebraic Initial function will always receive a single value in its tuple (at least according to this blog post).

To solve my issue, what I ended up doing was just return the single value in the Initial without processing at all. The Intermed and Final functions will perform the algorithm.

Since the Intermed function may receive outputs from either the Initial function or another Intermed function (this is according to the docs, did not see it in practice, in my tests, the Intermed always received values from the Initial function), both my Initial and Intermed functions now return a Tuple of two values. The first value in the tuple is a string telling me the source of the value - either "Initial" or "Intermed". The second value in the tuple is the actual result.

Upvotes: 1

Related Questions