Reputation: 139
I have written an udf (extends EvalFunc<Tuple>
) which has as output tuples with inner tuples (nested).
For example the dump looks like:
(((photo,photos,photo)))
(((wedg,wedge),(audusd,audusd)))
(((quantum,quantum),(mind,mind)))
(((cassi,cassie),(cancion,canciones)))
(((calda,caldas),(nova,novas),(rodada,rodada)))
(((fingerprint,fingerprint),(craft,craft),(easter,easter)))
Now I want to process each of this terms, distinct it and give it an id (RANK
). To do this, i need to get rid of the brackets. A simple FLATTEN
does not help in this case.
The final output should be like:
1 photo
2 photos
3 wedg
4 wedge
5 audusd
6 quantum
7 mind
....
My code (not the udf part and not the raw parsing):
tags = FOREACH raw GENERATE FLATTEN(tags) AS tag;
tags_distinct = DISTINCT tags;
tags_sorted = RANK tags_distinct BY tag;
DUMP tags_sorted;
Upvotes: 0
Views: 393
Reputation: 1691
I think your UDF is return is not optimal for your workflow. Instead of returning a tuple with variable number of fields (which are tuples), it would be a lot more convenient to return a bag of tuples.
Instead of
(((wedg,wedge),(audusd,audusd)))
you will have
({(wedg,wedge),(audusd,audusd)})
and you will be able to FLATTEN that bag to: 1. make the DISTINCT 2. RANK the tags
To do so, update your UDF like this :
class MyUDF extends EvalFunc <DataBag> {
@Override
public DataBag exec(Tuple input) throws IOException {
// create DataBag
}
}
Upvotes: 1