Reputation: 10291
Playing with Pig, my input file is:
1, 4, 6
1, 2, 7, 9
2, 5, 1
1, 3, 5, 1
2, 6, 2, 8
The first value in each row is the ID; the remainder of the row are simply unique values (each row can have a different number of columns).
I want to transpose the above into:
1, 2, 4, 6, 7, 9, 3, 5, 1
2, 5, 1, 6, 2, 8
So basically GROUP by ID, then flatten the rest of the columns and output that as each row.
Is PIG even the right approach here? I have a way to do this in M/R, but thought Pig might be ideal for this sort of thing.
Many thanks for any hints provided
Duncan
PS I do not care about the order of the values.
Upvotes: 0
Views: 1857
Reputation: 9073
Untested, but here's the general approach I'd take: Get a variable containing the ID and a bag of values, flatten it so you got rows of just ids and a single value, take the distinct rows, then group by the ID. This will give you a bag of values for each ID which you can convert to a string if you wanted to output.
A = LOAD 'input' USING TextLoader() as line:chararray;
B = FOREACH A GENERATE STRSPLIT(line,',',2) as (id:chararray,values:chararray)
C = FOREACH B GENERATE id, FLATTEN(TOBAG(STRSPLIT(values,','))) as value:chararray;
D = DISTINCT C; -- I'm assuming you actually want distinct values, wasn't clear.
E = GROUP D by id;
F = FOREACH E GENERATE group as id, BagToString(D.value) as valueString:chararray;
Upvotes: 2