Duncan
Duncan

Reputation: 10291

PIG (Hadoop) - rows with variable columns

Playing with Pig, my input file is:

1, 4, 6

1, 2, 7, 9

2, 5, 1

1, 3, 5, 1

2, 6, 2, 8

The first value in each row is the ID; the remainder of the row are simply unique values (each row can have a different number of columns).

I want to transpose the above into:

1, 2, 4, 6, 7, 9, 3, 5, 1

2, 5, 1, 6, 2, 8

So basically GROUP by ID, then flatten the rest of the columns and output that as each row.

Is PIG even the right approach here? I have a way to do this in M/R, but thought Pig might be ideal for this sort of thing.

Many thanks for any hints provided

Duncan

PS I do not care about the order of the values.

Upvotes: 0

Views: 1857

Answers (1)

DMulligan
DMulligan

Reputation: 9073

Untested, but here's the general approach I'd take: Get a variable containing the ID and a bag of values, flatten it so you got rows of just ids and a single value, take the distinct rows, then group by the ID. This will give you a bag of values for each ID which you can convert to a string if you wanted to output.

A = LOAD 'input' USING TextLoader() as line:chararray; 
B = FOREACH A GENERATE STRSPLIT(line,',',2) as (id:chararray,values:chararray)
C = FOREACH B GENERATE id, FLATTEN(TOBAG(STRSPLIT(values,','))) as value:chararray;
D = DISTINCT C; -- I'm assuming you actually want distinct values, wasn't clear.
E = GROUP D by id;
F = FOREACH E GENERATE group as id, BagToString(D.value) as valueString:chararray;

Upvotes: 2

Related Questions