concat every field in pig?

Question

I am currently trying to create a concatenating string for each group. This string should be the concatenation of all the occurrences of the field.

For the moment my code looks like :

grouped = GROUP a by group_field;

b = FOREACH grouped {
    unique_field = DISTINCT myfield;
    tupl = TOTUPLE(unique_field) ; 
    FOREACH tupl GENERATE group as id, CONCAT( ? ) as my_new_string;
}

The thing is I absolutely do not know for each group the number of distinct fields or what they contains. I don't know how what to do to replace the ? and make it work.

mr2ert · Accepted Answer

TOTUPLE is not doing what you are expecting, it is making a one element tuple where that one element is the bag of unique_field.

Also, CONCAT only takes two things to concat and they must be explicitly defined. Let's say that you have a schema like A: {A1: chararray, A2: chararray, A3: chararray} and you want to concatinate all fields together. You will have to do this (which is obviously not ideal): CONCAT(CONCAT(A1, A2), A3).

Anyways, this problem can be easily solved with a python UDF.

myudfs.py

#!/usr/bin/python

@outputSchema('concated: string')
def concat_bag(BAG):
    return ''.join(BAG)

This UDF would be used in your script like:

Register 'myudfs.py' using jython as myfuncs;

grouped = GROUP a by group_field;

b = FOREACH grouped {
    unique_field = DISTINCT myfield;
    GENERATE group as id, myfuncs.concat_bag(unique_field);
}

I just noticed the FOREACH tupl GENERATE ... line. That is not valid syntax. The last statement in a nested FOREACH should be a GENERATE.

concat every field in pig?

Answers (1)

Related Questions