Reputation: 373
I am currently trying to create a concatenating string for each group. This string should be the concatenation of all the occurrences of the field.
For the moment my code looks like :
grouped = GROUP a by group_field;
b = FOREACH grouped {
unique_field = DISTINCT myfield;
tupl = TOTUPLE(unique_field) ;
FOREACH tupl GENERATE group as id, CONCAT( ? ) as my_new_string;
}
The thing is I absolutely do not know for each group the number of distinct fields or what they contains. I don't know how what to do to replace the ?
and make it work.
Upvotes: 2
Views: 2694
Reputation: 5186
TOTUPLE
is not doing what you are expecting, it is making a one element tuple where that one element is the bag of unique_field
.
Also, CONCAT
only takes two things to concat and they must be explicitly defined. Let's say that you have a schema like A: {A1: chararray, A2: chararray, A3: chararray}
and you want to concatinate all fields together. You will have to do this (which is obviously not ideal): CONCAT(CONCAT(A1, A2), A3)
.
Anyways, this problem can be easily solved with a python UDF.
myudfs.py
#!/usr/bin/python
@outputSchema('concated: string')
def concat_bag(BAG):
return ''.join(BAG)
This UDF would be used in your script like:
Register 'myudfs.py' using jython as myfuncs;
grouped = GROUP a by group_field;
b = FOREACH grouped {
unique_field = DISTINCT myfield;
GENERATE group as id, myfuncs.concat_bag(unique_field);
}
I just noticed the FOREACH tupl GENERATE ...
line. That is not valid syntax. The last statement in a nested FOREACH
should be a GENERATE
.
Upvotes: 3